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MEMORY MANAGEMENT 

Objectives of this chapter 


Objectives of this chapter 

• Give an overview of physical and virtual memory 

• Describe the different structures associated with virtual memory and 
explain their purposes 

• Explain how memory is mapped from physical to virtual and vice 
versa. 

• Explain how pages of memory are made and kept availablefor 
process/thread execution. 

• Describe how swap space is managed 

• Descri be how process structures are set up i n memory 

• Understand how and why memory pages are allocated, freed up, and 
recovered 
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OVERVIEWOF PHYSICAL AND 
VIRTUAL MEMORY 

The memory management system is designed to make memory resources 
avail able safely and efficiently among threads and processes: 

• 11 provides a complete address space for each process, protected from 
all other processes. 

• It enables program size to be larger than physical memory. 

• It decides which threads and processes reside in physical memory and 
manipulates threads and processes in and out of memory. 

• It manages the parts of the virtual address space of a thread or 
process not in physical memory and determines what portions of the 
address space should reside in physical memory. 

• It allows efficient sharing of memory between processes. 

The data and instructions of any process (a program in execution) or 
thread of execution within a process must be availabletothe CPU by 
residing in physical memory at the time of execution. 

To execute a process, the kernel creates a per-process virtual address 
space that is set up by the kernel; portions of the virtual space are 
mapped onto physical memory. Virtual memory allows the total size of 
user processes to exceed physical memory. Through "demand paging", 
HP-UX enables you to execute threads and processes by bringing virtual 
pages into main memory only as needed (that is, "on demand") and 
pushing out portions of a process's address space that have not been 
recently used. 

The term "memory management" refers to the rules that govern physical 
and virtual memory and allow for efficient sharing of the system's 
resources by user and system processes. 

The system uses a combination of pageout and deactivation to manage 
physical memory. Paging involves writing recently unreferenced pages 
from main memory to disk from time to time. A page is this smallest unit 
of physical memory that can be mapped to a virtual address with a given 
set of access attributes. On a loaded system, total unreferenced pages 
might be a large fraction of memory. 
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Deactivation takes place if the system is unableto maintain a large 
enough free pool of physical memory. When an entire process is 
deactivated, the pages associated with the process can be written out to 
secondary storage, sincethey are no longer referenced. A deactivated 
process cannot run, and therefore, cannot reference its data. 

Secondary storage supplements physical memory. The memory 
management system monitors available memory and, when it is low, 
writes out pages of a process or thread to a secondary storage device 
called a swap device. The data is read from the swap device back into 
physical memory when it is needed for the process to execute. 

Pages 

Pages are the smallest contiguous block of physical memory that can be 
allocated for storing data and code. Pages are also the smallest unit of 
memory protection. The page size of all H P-UX systems isfour kilobytes. 

On a PA-RI SC system, every page of physical memory is addressed by a 
physical page number (PPN), which is a software "reduction” of the 
physical page number from the physical address. Access to pages (and 
thus to the data they contain) are done through virtual addresses, except 
under specific circumstances. 1 

Virtual Addresses 

When a program is compiled, the compiler generates virtual addresses 
for the code. Virtual addresses represent a location in memory. These 
virtual addresses must be mapped to physical addresses (locations of the 
physical pages in memory) for the compiled code to execute. User 
programs use virtual addresses only. 

The kernel and the hardware coordinate a mapping of these virtual and 
physi cal addresses for the CP U, cal Ied "address transl ation," to I ocate the 
process i n memory. 

A PA-RI SC virtual address consists of a space identifier (SI D) and an 
offset. 

• Each space ID represents a 4 GB unit of virtual memory. 


1. When virtual translation must be turned off (the D and I bits are 
off), pages are accessed by their absolute addresses. 
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Table 1-1 


• The offset portion of a virtual address is the offset into this space. 

Format of a 48-bit virtual address 


Space 1D 

Offset 

(16 bits) 

(32 bits) 


Every process running on a PA-RI SC processor shares a 48-bit (or larger, 
depending on HP-PA architecture version) global virtual address space 
with the kernel and with all other processes running on that machine. 
Although any process can create and attempt to read or write any virtual 
address, the kernel uses page granularity access control mechanisms to 
prevent unwanted interference between processes. 

When a virtual page is "paged" into physical memory, free physical pages 
are allocated to it from the free list. These pages may be randomly 
scattered throughout the memory depending on their usage history. 
Translations are needed to tel I the processor where the virtual pages are 
loaded. The process of translating the virtual into physical address is 
called virtual address translation. 

Potentially the virtual address space can be much greater than the 
physical address space. The virtual memory system enables the CPU to 
execute programs much larger than the available physical memory and 
allows you run many more programs at a time than you could without a 
virtual memory system. 

Demand Paging 

For a process to execute, all the structures for data, text, and soon have 
to be set up. H owever, pages are not loaded in memory until they are 
"demanded" by a process - hence the term, demand paging. Demand 
paging allows the various parts of a process to be brought into physical 
memory as the process needs them to execute. Only the working set of 
the process, not the entire process, need be in memory at onetime. A 
translation is not established until the actual page is accessed. 
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THE ROLE OF PHYSICAL MEMORY 

Memory is the "container" for data storage; the general repository for 
high-speed data storage is close to the CPU, and istermed random access 
memory (RAM) or "main memory." For the CPU to execute a process, 
the code and data referenced by that process must reside in random 
access memory (RAM). RAM holds data during process execution in two 
even-faster implementations of memory, registers and cache, found on 
the processor. RAM is shared by all processes. 

The more main memory in the system, the more data the system can 
access and the more (or larger) processes it can retain and execute 
without havi ng to page or cause deactivation as frequently. 
Memory-resident resources (such as page tables) also take up space in 
main memory, reducing the space avail able to applications. 

At boot time, the system loads H P-UX from disk into RAM, where it 
remains memory-resident until the system is shut down. 

User programs and commands too are Ioaded from disk into RAM. When 
a program termi nates, the operati ng system frees the memory used by 
the process. 

Disk access is slow compared to RAM access. Excessive disk access can 
lead to increased latency or reduced throughput and can lead to the disk 
access becoming the bottleneck in the system. To avoid this, you need to 
do some sort of buffering. Buffering, paging, and deactivation algorithms 
optimize disk access and determi ne when data and code for currently 
runni ng programs are returned from RAM to disk. When a user or 
system program writes data to disk, the data is either written directly to 
RAM (if raw data) or buffered in what is called buffer cache and written 
to disk in relatively big chunks. Programs also read files and database 
structures from disk into RAM. When you issue the sync command 
before shutting down a system, all modified buffers of the buffer cache 
are flushed (written) out to disk. 
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Figure 1-1 Physical memory available to processes 


HP-UX kernel 
at bootup 



Lockable memory 

I I 

Available memory 

I 

Physical memory 


Available Memory 

The amount of mai n memory not reserved for the kernel is termed 
available memory. Available memory is used by the system for executing 
user processes. 

Not all physical memory is availableto user processes. Kernel text and 
initialized data occupy about 8 MB of RAM. 

I nstead of allocating all its data structures at system initialization, the 
HP-UX kernel dynamically allocates and releases somekernel structures 
as needed by the system during normal operation. This allocation comes 
from the available memory pool; thus, at any given time, part of the 
available memory is used by the kernel and the remainder is available 
for user programs. 

Physical address space is the entire range of addresses used by hardware 
(4GB), and is divided into memory address space, processor-dependent 
code (PDC) address space, and I/O address space. The next figure shows 
the expanse of memory avail able for computation. Memory address 
space takes up most of the system address space, while address space 
allotted to PDC and I/O consume a relatively small range of addresses. 
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Figure 1-2 Major sections of system address space. 
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Lockable Memory 

Pages kept i n memory for the I ifeti me of a process by means of a system 
call (such asmiock, piock, or shmcti) are termed lockable memory. 
Locked memory cannot be paged and processes with locked memory 
cannot be deactivated. Typically, locked memory holds frequently 
accessed programs or data structures, such as critical sections of 
application code. Keeping them memory-resident improves application 
performance. 

The iockabie_mem variable tracks how much memory can be locked. 

Available memory is a portion of physical memory, minus the amount of 
space required for the kernel and its data structures. The initial value of 
iockabie_mem is the available memory on the system after boot-up, 
minus the value of the system parameter, uniockabie_mem. 

The value of lockable memory depends on several factors: 

• The size of the kernel varies, depending on the number of interface 
cards, users, and values of the tunable parameters. 

• Available memory varies from system to system. 
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• The system parameter uniockabie_mem is a kernel tunable 
parameter. Changi ng the val ue of uniockabie_mem alters the 
default value of lockable_mem also. 

H P-UX places no explicit limits on the amount of available memory you 
may lock down; instead, HP-UX restricts how much memory cannot be 
locked. 

Other kernel resources that use memory (such as the dynamic buffer 
cache) can cause changes. 

• As memory is used, the amount of memory that can be locked 
decreases. 

• As memory is freed up, the amount of memory that can be locked 
increases. 

As the amount of memory that has been locked down increases, existing 
processes compete for a smaller and smaller pool of usuable memory. If 
the number of pages in this remaining pool of memory falls below the 
paging threshold called lotsfree, the system will activates its paging 
mechanism, by scheduling vhand in an attempt to keep a reasonable 
amount of memory free for general system use. 

Care must be taken to allow sufficient space for processes to make 
forward progress; otherwise, the system is forced into paging and 
deactivating processes constantly, to keep a reasonable amount of 
memory free. 

Secondary Storage 

Data is removed to secondary storage if the system is short of mai n 
memory. The data is typically stored on disks accessible either via 
system buses or network to make room for active processes. 

Swap refers to a physical memory management strategy (predating 
UNIX) where enti re processes are moved between mai n memoryand 
secondary storage. Modern virtual memory systems today no longer 
swap entire processes, but rather use a deactivation schemethat allows 
pages to be pushed out over time by a paging mechanism. While 
executing a program, pages of data and instructions can be paged out to 
or paged in from secondary storage if the system load warrants such 
behavior. 
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Device swap can take the form of an entire disk or LVM 1 logical volume 
of a disk. A filesystem can be configured tooffer free space for swap; this 
is termed file-system swap. If more swap space is required, it can be 
added dynamically to a running system, as either device swap or 
file-system swap. Theswapon command is used to allocate disk space or 
a directory in a file system for swap. 


1. Logical Volume Manager (LVM) is a set of commands and underly¬ 
ing software to handle disk storage resources with more flexibility 
than offered by traditional disk partitions. 
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THE ABSTRACTION OF VIRTUAL 
MEMORY 

A computer has a finite amount of RAM available, but each H P-UX 
process has a 4GB virtual address space apportioned in four 
one-gigabyte quadrants, termed virtual memory. 

Virtual memory is the software construct that allows each process 
sufficient computational space in which toexecute. It is accomplished 
with hardware support. 

Virtual Space in PA-RISC 

As software is compiled and run, it generates virtual addresses that 
provide programmers with memory space many times larger than 
physical memory alone. The number of bits avail able for the space 
determines the ultimate size of the virtual address space. At PA-RISC 
1.x, the operating system has 32-bit physical addressing and 48-bit 
virtual addressing (the latter consisting of 16-bit space and 32-bit offset 
to allow for 4 GB per space); the total virtual address range is 

(2 -16)* 4 GB =262,144 GB 

By comparison, Level 2 has a far greater total virtual address range of 
( 2 - 32)* 4GB = 17,179,869,184GB 

NOTE Understand, however, that a single process has significant limitations on 

the virtual address space it is allowed to access. For example, a 
s h are_mag i c executable text is limited to 1 GB and data is limited to 1 
GB. The total amount of shared virtual address space in the system is 
limited to 1.75 GB. 


Physical Addresses 

A physical address points to a page in memory that represents 4096 
bytes of data. The physical address also contains an offset intothis page. 
Thus, the complete physical address is composed of a physical page 
number(PPN) and pageoffset. ThePPN is the 20 most significant bits of 
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the physical address where the page is located. These bits are 
concatenated with an 12-bit page offset to form the 32-bit physical 
address. 

Figure 1-3 Bit layout of physical address 

Page Number Page Offset 

0000000000000000100 100001110011 

0 19 20 31 

Tohandlethe translation of the virtual address to a physical addressthe 
virtual address a I so needs to be looked at as a virtual pagenumber(VPN) 
and page offset. Sincethe pagesizeis 4096 bytes, the low order 12bitsof 
the offset are assumed to be the offset i nto the page. The space ID and 
the high order 20 bits of the offset are the VPN. 

For any given address you can determine the page number by discarding 
the least significant 12 bits. What remains is the virtual page number 
for a virtual address or the physical page number for the physical 
address. 

The next figure shows the bit layout of a virtual address of 0x0.4873. 

Figure 1-4 Bit layout of virtual page address 

16-bit Space ID 32-bit Offset 
0000000000000000 00000000000000000100 100001110011 



VPN = 0x4 Page Offset 

0x873 


The virtual page number's address must be translated to obtain the 
associated physical page number, with page offset 0x873. 
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MEMORY-RELEVANT PORTIONS OF 
THE PROCESSOR 


Figure 1-5 Processor architecture, showing major components 



Thefigure above and the tablethat follows, name the principal processor 
components; of them, registers, translation lookaside buffer, and cache 
are crucial to memory management, and will be discussed in greater 
detail following the table. 
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Processor Architecture, components and purposes 
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Component 

Purpose 

Central Processing 
Unit (CPU) 

The main component responsible for reading 
program and data from memory, and 
executing the program instructions. Within 
the CPU are the foil owing: 

• Registers, high-speed memory used to hold 
data while it is being manipulated by 
instructions, for computations, 
interruption processing, protection 
mechanisms, and virtual memory 
management. Registers are discussed 
shortly in greater detail. 

• Control Hardware (alsocalled instruction 
or fetch unit) that coordinates and 
synchronizes the activity of the CPU by 
interpreting (decoding) instructions to 
generate control signals that activate the 
appropriate CPU hardware. 

• Execution Hardware to perform the actual 
arithmetic, logic, and shift operations. 
Execution Hardware can take on many 
specialized tasks but most common arethe 
Arithmetic and Logic Unit (ALU) and the 
Shift Merge Unit (SMU). 
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Component 

Purpose 

1 nstruction and Data 
Cache 

The cache is a portion of high-speed memory 
used by the CPU for quick access to data and 
instructions. The most recently accessed data 
is kept in the cache. 

Translation 

Lookaside Buffer 
(TLB) 

The processor component that enables the 

CPU to access data through virtual address 
space by: 

• Translatingthevirtual address to physical 
address. 

• Checking access rights, so that access is 
granted to instructions, data, or I/O only if 
the requesting process has proper 
authorization. 

Floating Point 
Coprocessor 

An assist processor that carries out 
specialized tasks for the CPU. 

System 1 nterface 

Unit (SIU) 

Bus circuitry that allows the CPU to 
communicate with the central (native) bus. 


The Page Table or PDIR 

The operating system maintains a table in memory called the Page 
Directory (PDI R) which keeps track of all pages currently in memory. 
When a page is mapped in some virtual address space, it is allocated an 
entry in the PDI R. The PDI R is what links a physical pagein memory to 
its virtual address. 

The PDI R is implemented as a memory-resident table of software 
structures called page directory entries (PDEs), which contain virtual 
addresses. The PDI R maps the entire physical memory with one entry 
for every page in physical memory. Each entry contains a 48/64 bit 
virtual address. When the processor needs to find a physical page not 
indexed in theTLB, it can search the PDI R with a virtual address until it 
finds a matching address. 
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Table 1-3 


The PDIR table is a hash table with collision chains. The virtual address 
is used to hash into one of the buckets in the hash table and the 
corresponding chain is searched until a chain entry with a matching 
virtual address is found. 

Page Fault 

A trap occurs because translation is missing in the translation lookaside 
buffer (TLB, discussed shortly). If the processor can find the missing 
translation in the PDI R, it installs it in theTLB and allows execution to 
continue. If not, a page fault occurs. 

A page fault is a trap taken when the address needed by a process is 
missing from the main memory. This occurrance is also known asa PDIR 
miss. A PDI R miss indicates that the page is either on the free list, in the 
page cache, or on disk; the memory management system must then find 
the requested page on the swap device or in the filesystem and bring it 
into main memory. 

Conversely, a PDIR hit indicates that a translation exists for the virtual 
address in theTLB. 

The Hashed Page Directory (hpde) structure 

Each PDE contains information on the virtual-to-physical address 
translation, along with other information necessary for the management 
of each page of virtual memory. The structural elements of the hashed 
page directory for PA-RISC 1.1 are shown in the following table. 


struct hpde, the hashed page directory 


E lement 

Meaning 

pde_valid 

Flag set by the kernel to indicate a valid pde 
entry. 

pde_vpage 

Virtual page - high 15 bits of the virtual offset 

pde_space 

Contains the complete 16-bit virtual space 

pde_ref 

Reference bit set by the kernel when it receives 
certain interrupts; used by vhand() to tell if a 
page has been used recently 
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E lement 

Meaning 

pde_accessed 

Used by the stingy cache flush algorithm to 
indicate that the page may be in data cache 3 

pde_rtrap 

Data reference trap enable bit; when set, any 
access to the page causes a page reference trap 
interruption 

pde_dirty 

Dirty bit; marked if the page differs in memory 
from what is on disk. 

pde_dbrk 

Data break; used by the TLB 

pde_ar 

Access rights; used by the TLB. b 

pde_protid 

Protection 1D, used by the TLB. 

pde_executed 

Used by the stingy cache flush algorithm to 
indicate that page is referenced as text. 

pde_uip 

Lock flag used by trap-handling code. 

pde_phys 

Physical page number; the physical memory 
address divided by the page size. 

pde_modified 

1 ndicator to the high-level virtual memory 
routi nes as to whether the page has been modified 
since last written to a swap device. 

pde_ref_trick 

le 

Trickle-up bit for references. Used with pde_ref 
on systems whose hardware can search the htbl 
directly. 

pde_block_map 

ped 

Block mapping flag; indicates page is mapped by 
block TLB and cannot be aliased. 

pde_alias 

Virtual alias field. If set, thepde has been 
allocated from elsewhere in kernel memory, 
rather than as a member of the sparse PDIR. 

pde_next 

Pointer to next entry, or null if end of list. 


a. Stingy cache flush is a performance enhancement by which 
the kernel recognizes whether or not to flush the cache. 

b. For detailed information on access rights, seethe PA-RI SC 2.0 
Architecture reference, chapter 3, "Addressing and Access Con- 

_ trol." For information about how programs can manipulate _ 

22 this field, seemmap (2) and mprotect (2) manpages. Chapterl 
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A word-oriented hpde structure (struct whpde) is implemented for 
faster manipulation and is documented in 

/usr/include/machine/pde. h. The pde. h header file also contains 
the definitions space for manipulation, maximum number of entries in 
the PDIR hashtable, constants related to field positions within the PDE 
structure, access rights (which are now given on a region basis), and 
another hashed page directory (struct hpde2_o) for PA-RISC 2.0. 


NOTE The 2.0 version of the hpde structure has a field named var_page that 

can hold the page size information. This is used in implementing 
super-pages (>4K) on systems based on the PA 2.0 processor. 


Translation Lookaside Buffer (TLB) 

The translation lookaside buffer (TLB) translates virtual addresses to 
physical addresses. 

Figure 1-6 Role of the TLB. 



Address translation is handled from the top of the memory hierarchy 
hitting the fastest components first (such as the TLB on the processor) 
and then moving on tothe page directory table (pdir in main memory) 
and lastly to secondary storage. 

Organization and Types of TLB 

Depending on model, the TLB may be organized on the processor in one 
of two ways: 

• Unified TLB - A singleTLB that holds translations for both data and 
instructions. 
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• Split Data and I nstruction TLB - Dual TLB units in the processor 
each of which hold translations specifically for data or instructions. 

At onetime many systems were being designed with split Data TLB 
(DTLB) and I nstruction TLB (ITLB), to account for the different 
characteristics of data and instruction locality and type of access 
(frequent random access of data versus relatively sequential single usage 
of instructions). Cost factors have allowed the inclusion of much larger 
TLBs on processors, which has lessened the disadvantages of a unified 
TLB. Asa result many newer processors have unified TLBs. 

Block TLB 

In addition to the standard TLB that maps each entry to a single page of 
memory, many processors also have a block TLB. The block TLB is used 
to map entries to virtual address ranges larger than a single page, that 
is, multiple hpdes. Block TLB entries are used to reference kernel 
memory that remains resident. Si nee the operating system moves data 
in and out of memory by pages, a range of pages referenced by a block 
TLB entry is locked in memory and cannot be paged out. Addressing 
blocks of pages thus increases the overall address range of theTLB and 
the speed with which large transactions can be serviced, and thus may 
bethought of as a hardware implementation of large pages. The block 
TLB is typically used for graphics, because their data is accessed in huge 
chunks. It is also used for mapping other static areas such as kernel text 
and data. 
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Figure 1-7 


The TLB is a cache for address translations 
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The TLB translates addresses 

TheTLB looks up the translation for the virtual page numbers (VPNs) 
and gets the physical page numbers (PPNs) used to reference physical 
memory. 

Ideally the TLB would be large enough to hold translations for every 
page of physical memory; however this is prohibitively expensive; 
instead the TLB holds a subset of entries from the page directory table 
(PDIR) in memory. TheTLB speeds up the process of examining the 
PDIR by caching copies of its most recently utilized translations. 

Because the purpose of theTLB is to satisfy virtual to physical address 
translation, theTLB is only searched when memory is accessed while in 
virtual mode. This condition is indicated by the D-bit in the PSW (or the 
I-bit for instruction access). 

TLB Entries 

Since theTLB translates virtual to physical addresses, each entry 
contains both the Virtual Page Number (VPN) and the Physical Page 
Number (PPN). Entries also contain Access Rights, an Access Identifier, 
and five flags. 
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TLB flags (PA 2.x architecture) 


Flag 

Meaning 

0 

Ordered. Accesses to data for load and store are ranked by 
strength -- strongly ordered, ordered, and weakly ordered. 
(See PA-RI SC 2.0 specifications for model and definitions.) 

U 

U ncacheabl e. Determi nes whether data references to a page 
from memory address space may be moved into the cache. 
Typically set to 1 for data references to a page that maps to 
the I/O address space or for memory address space that 
must not be moved into cache. 

T a 

Page Reference bit. 1 f set, any access to this page causes a 
reference trap to be handled either by hardware or software 
trap handlers 

D 

Dirty Bit. When set, this bit indicates that the associated 
page i n memory differs from the same page on disk. The 
page must be flushed before being invalidated. 

B 

Break.This bit causes a trap on any instruction that is 
capable of writing to this page 

P 

Prediction method for branching; optional, used for 
performance tuning. 


a. TheT,D, and B flags are only present in data or unified TLBs. 


In PA 1.x architecture, an E bit (or "valid" bit) indicates that theTLB 
entry reflects the current attributes of the physical page in memory. 

Instruction and Data Cache 

Cache is fast, associative memory on the processor module that stores 
recently accessed instructions and data. From it, the processor learns 
whether it has immediate access to data or needs to go out to (slower) 
main memory for it. 

Cacheabledata goingtotheCPU from main memory passes through the 
cache. Conversely, the cache serves as the means by which the CPU 
passes data to and from main memory. Cache reduces the time required 
for the CP U to access data by mai ntai ni ng a copy of the data and 
instructions most recently requested. 


26 


Chapter 1 




MEMORY MANAGEMENT 

MEMORY-RELEVANT PORTIONS OF THE PROCESSOR 


A cache improves system performance because most memory accesses 
are to addresses that are very close to or the same as previously accessed 
addresses. The cache takes advantage of this property by bringing into 
cache a block of data whenever the CPU requests an address. Though 
this depends on size of the cache, associativity, and workload, a vast 
majority of the ti me (accordi ng to performance measurements), the cache 
has what you're looking for the next time, enabling you to reference it. 

Cache Organization 

Depending on model, PA-RISC processors are equipped with either a 
unified cache or separate caches for instructions and data (for better 
locality and faster performance). I n multiprocessing systems, each 
processor has its own cache, and a cache controller maintains 
consistency. 

Cache memory itself is organized as follows: 

• A quantity of equal-sized blocks called cache lines, defined to be the 
same unit of size as data passed between cache and main memory. A 
cache line can be 16, 32, or 64 bytes long, aligned. 

• One 15-bit long cache tag for every cache line, to describe its contents 
and determine if the desired data is present. The tag contains 

• Physical Page Number (PPN), identifying the page in main 
memory where the data resides. 

• FIag Bits When set, a valid flag indicates the cache line contains 
valid data. A dirty bit is set if the CPU has modified contents of 
the cache line; that is, the cache (not main memory) contains the 
most current data. If the dirty bit is not set, the flag is said to be 
"clean," meaning that the cache line does not have modified 
contents. Other implementation-specific flags may be present. 

• Both the cache tag and cache line have associated parity bits used for 
checksumming, to make sure the line is correct. 
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Figure 1-8 


Every cache entry consists of a cache tag and cache line. 
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How the CPU Uses Cache And TLB 

When a process executes, it stores its code (text) and data in processor 
registers for referencing. If the data or code is not present in the 
registers, the CPU supplies the virtual address of the desired data to the 
TLB and to the cache controller. Depending on implementation, caches 
can be direct mapped, set associative, or fully associative. Recent PA 
implementations use direct associative caches and fully associative 
TLBs. Virtual addresses can be sent in parallel totheTLB and cache 
because the cache is virtually indexed. 

A physical page may not be referenced by more than one virtual page, 
and a virtual address cannot translate to two different physical 
addresses; that is, PA-RISC does not support hardware address aliasing, 
although HP-UX implements software address aliasing for text only in 
EXEC_MAGIC executables. 

The cache controller uses the low-order bits of the virtual address to 
index into the direct-mapped cache. Each index in the cache finds a cache 
tag containing a physical pagenumber (PPN) and a cachelineof data. If 
the cache controller finds an entry at the cache location, the cache line is 
checked to see whether it is the right one by Hooking at the PPN in the 
cache tag and the one returned by the TLB, because blocks from many 
different locations in main memory can be mapped legitimately to a 
given cache location. If the data is not in cache but the page is 
translated, the resultant data cache miss is handled completely by the 
hardware. A TLB miss occurs if the page is not translated in theTLB; if 
the translation is also not in the PDIR, H P-UX uses the page fault code 
to fault it in. If not in RAM, the data and code might have to be paged 
from disk, in which case the disk-to-memory transaction must be 
performed. 
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Figure 1-9 PPNsfrom Cache and TLB are compared 



On a more detailed level, the next figure demonstrates the mapping of 
virtual and physical address components. 

Figure 1-10 Virtual address translation 


Virtual address 



TLB Hits and Misses 

The sequence fol lowed by the processor as it val idates addresses is one of 
"hit or miss." 

• The TLB is searched; that is, each virtual address and byte offset 
issued by the processor indexes an entry in the TLB. 
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• If the entry is valid, it is known as a TLB hit. The TLB contains a 
valid physical page number (PPN), which might be accessed in 
cache. 

• If theentry is invalid or theTLB cannot providea physical page 
number, a TLB miss occurs and must be handled. On certain 
systems, a hardware walker searches the PDIR and if it finds the 
page, updates theTLB. On systems not equipped with a hardware 
TLB handler or if the hardware walker does not find an entry in 
the PDI R, a software interrupt is generated. The software 
interrupt resolves the fault and updates theTLB, allowing the 
access to proceed. 

There are five TLB miss handlers (instruction, data, non-access 
instruction, nonaccess data, and dirty) located in locore. s; the header 
file pde. h has the TLB/PDl R structure definition. 

TLB Role in Access Control and Page 
Protection 

I n addition to assisting in virtual address translation, thetranslation 
lookaside buffer (TLB) serves a security function on behalf of the 
processor, by controlling access and ensuring that a user process sees 
only data for which it has privilege rights. 

TheTLB contains access rights and protection identifiers. PA-RISC 
allows up to four protection I Ds to be associated with each process. 
These I Ds are held in control registers CR-8, CR-9, CR-12, and CR-13. 
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Table 1-5 


Security checks in theTLB 


Security check 

Purpose 

Protection 

Checks 

The P-bit (Protection 1D Validation Enable bit) of 
the Processor Status Word (PSW) is checked: 

• 1 f not set, protection checki ng on the page is 
waived, as though passed and checking proceeds 
to access rights validation. 

• If the protection 1D validation bit is set, the 
access 1D of theTLB entry is compared to the 
protection 1 Ds in CR-8, CR-9, CR-12, and CR-13. 

Access Rights 
Check 

Access Rights are stored in a seven-bit field 
containing permissible access type and two 
privilege levels affecting the executing instruction: 

• Access types are read, write, execute. 

• Privilege levels checked for read access and 
write access, kernel and user execution. 


Figure 1-11 shows the checkpoints for controlling access to a page of data 
through theTLB. Two checks are performed: protection check and 
access rights check. If both checks pass, access is granted to the page 
referenced by theTLB. 
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Figure 1-11 Access control to virtual pages 



Cache Hits and Misses 

• When the cache line was first copied into the cache, its Physical Page 
N umber was stored i n the correspondi ng cache tag. The cache 
controller compares the PPN from the tag to the PPN supplied by the 
TLB. 

• If thePPN in the cache tag matches the PPN fromtheTLB, a 
cache hit occurs. The data is present in the cache and is supplied 
totheCPU. 

• IfthePPN in the cache tag does not match the PPN from theTLB, 
a cache miss occurs. I n a cache miss, the cache line is loaded from 
memory, because the byted referenced on the virtual page are not 
yet in cache. (Typically, our implementations do not load an entire 
page at a ti me to the cache; they I oad a cache I i ne at a ti me.) The 
data is absent from cache and theCPU must wait whilethedata is 
brought into the cache from main memory. 
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If the two PPNs do not match (assuming a TLB hit), the cache line is 
I oaded because the bytes referenced on the vi rtual page are not yet i n the 
cache. Theti me it takes to service a cache miss varies depending on if the 
data already present in the cache is clean or dirty. (When the cache is 
dirty, the old contents are written out to memory and the new contents 
are read in from memory.) If the cache line is "clean" (that is, not 
modified), it does not have to be written back to main memory, and the 
penalty is fewer instruction cycles than if the cache is dirty and must be 
written back to main memory. 

All PA-RI SC machines use a cache write-back policy, meaning that the 
main memory is updated only when the cache line is replaced. 


Figure 1-12 Summary of page retrieval from TLB, Cache, PDIR 

Page found in PDIR (deposit in TLB) 



s/w handler 


PA-RI SC allows for privilege level promotion by using a GATEWAY 
instruction. This instuction performs an interspace branch to increase 
the privilege level. The most common example of this in H P-UX is a 
system call, which changes the privilege level from user to kernel. 
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Registers 

Registers, high-speed memory in the processor’s CPU, are used by the 
software as storage elements that hold data for instruction control flow, 
computations, interruption processing, protection mechanisms, and 
virtual memory management. 

All computations are performed between registers or between a register 
and a constant (embedded in an instruction), which minimizes the need 
to access main memory or code. This register-intensive approach 
accelerates performance of a PA-RI SC system. This memory is much 
faster than conventional main memory but it is also much more 
expensive, and therefore used for processor-specific purposes. 

Registers areclassified as privileged or non-privileged, depending on the 
privilege level of the instruction being executed. 
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Table 1-6 


Types of Registers 


Type of Register 

Purpose 

32 General 

Registers, each 32 
bits in size. 

(non-privileged) 

Used to hold immediate results or data that 
is accessed frequently, such as the passing 
of parameters. Listed are those with uses 
specified by PA-RI SC or H P-UX. 

• GRO - Permanent Zero 

• GR1 - ADDIL target address 

• GR2- Return pointer. Contains the 
instruction offset of the instruction to 
which to return 

• GR23 - Argument word 3 (arg3) 

• GR24 - Argument word 2 (arg2) 

• GR25 - Argument word 1 (argl) 

• GR26 - Argument word 0 (argO) 

• GR27 - Global data pointer (dp) 

• GR28 - Return value 

• GR29 - Return value (double) 

• GR30 - Stack pointer (sp) 

7 Shadow Registers 

(privileged) 

Store contents of GR1,8,9,16,17,24, and 25 
on i nterrupt, so that they can be restored on 
return from interrupt. Numbered 
SHR0-SHR6. 

8 Space Registers, 
holding 16, 24, or 
32-bit space 1D. 

(SR5-SR7 are 
privileged) 

Hold the space 1 Ds for the current running 
process. 

• SRO -1 nstruction address space link 
register used for branch and link 
external instructions. 

• SR1-SR7 - Used to form virtual 
addresses for processes. 
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Type of Register 

Purpose 

32 Control 

Registers, each 32 
bits, (most are 
privileged) 

Used to reflect different states of the 

system, many related primarily to interrupt 

handling. 

• CRO - Recovery Counter, used to provide 
software recovery of hardware faults in 
fault-tolerant systems and for 
debugging. 

• CR10 - Low-order bits are known as the 
Coprocessor Configuration Register 
(CCR), 8 bits that indicate presence and 
usabi 1 ity of coprocessors. B its 0, 1 
correspond to the floati ng poi nt 
coprocessor; bit 2, the performance 
monitor coprocessor. 

• CRM -1 nterruption Vector Address 
(IVA) 

• CR16 -1 nterval Timer. Two internal 
registers, one counting at a rate between 
twice and half the 
implementation-specific "peak 
instruction rate", the other register 
containing a 32-bit comparison value. 

Each processor in a multi-processor 
system has its own 1 nterval Timer, but 
they need not be synchronized nor clock 
at the same frequency. 

• CR17 - Stores the contents of the 

1 nstruction Address Space Queue at 
time of interruption. 

• CR19 - Used to pass an instruction to an 
interrupt handler. 

• CR20, CR21 - Used to pass a virtual 
address to an instruction handler. 

• CR26, CR27 - Temporary registers 
readable by code executing at any 
privilege level but writable only by 
privileged code. 
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Type of Register 

Purpose 

64 Floating Point 
Registers, 32-bits 
each, or 32, 64-bits 
each. 

Data registers used to hold computations. 

• FP-OL - Status register. Controls 
arithmetic modes, enables traps, 
indicates exceptions, results of 
comparison, and identifies coprocessor 
implementation. 

• FP-OR through FP-3- Exception 
registers, containing information on 
floating point operations whose 
execution has completed and caused a 
delayed trap. 

2 1 nstruction 

Address Queues, 
each 64 bits 

Two queues 2 elements deep. The front 
elements of the queues (1 ASQ_F ront and 

1 AOQ_F ront) form the vi rtual address of the 
current instruction, whiletheback elements 
(1 ASQ_Back and 1 AOQ_ESack) contain the 
address of the foil owing instruction. 

• 1 nstruction Address Space Queue holds 
the space 1D of the current and following 
instruction. 

• 1 nstruction Address Offset Queue holds 
the offset of the instruction for the given 
space H igh-order 62 bits contain the 
wrod offset of the instruction; the 2 
low-order bits maintain the privilege 
level of the instruction. 

1 Processor Status 
Word (PSW), 32 bits 

(privileged) 

Contains the current processor state. When 
an interruption occurs, the PSW is saved 
into the 1 nterrupt Processor Status Word 
(1 PSW), to be restored later. Low-order five 
bits of the PSW are the system mask, and 
are defined as mask/unmask or 
enable/disable. 1 nterrupts disabled by PSW 
bit are ignored by the processor; interrupts 
masked remain pending until unmasked. 
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VIRTUAL MEMORY STRUCTURES 


Figure 1-13 Memory management structures 



Process management uses kernel structures down tothepregions to 
execute a process. The u_ 31 TS 3 / piroc structure, vss, and pirsgion are 
per-process resources, because each process has its own unique copies of 
these structures, which are not shared among multiple processes. 

Below the pregion level are the systemwide resources. These 
structures can be shared among multiple processes (although they are 
not required to be shared). 

Memory management kernel structures map pregions to physical 
memory and provide support for the processor's ability to translate 
virtual addresses to physical memory. The table that follows introduces 
the structures involved in memory management; these are discussed 
later in detail. 
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Table 1-7 


Principal Memory Management Kernel Structures 


Kernel 

structure 

Purpose 

vas 

Keeps track of the structural elements associated with a 
process in memory. One vas maintained per process. 

pregion 

A per-process resourcethat describes the regions 
attached to the process. 

region 

A memory-resident system resourcethat can be shared 
among processes. Points to the process's B-tree, vnode, 
pregions. 

B-tree 

Balanced tree that stores pairs of page indices and 
chunk addresses. At the root of a B-tree of vfds and 

DBDS is struct broot. 

hpde 

Contains information for virtual to physical translation 
(that is, from vfd to physical memory). 


Virtual Address Space (vas) 

The vas represents the vi rtual address space of a process and serves as 
the head of a doubly linked list of process region data structures called 
pregions. The vas data structure is always memory resident. 

When aprocess is invoked, the system allocates a vas structure and puts 
its address in p_vas, afield in theproc structure. 

The virtual address space of a process is broken down intological chunks 
of virtually contiguous pages. (Seethe Process Management whitepaper 
for table of vas entries.) 

Virtual memory elements of a pregion 

Each pregion represents a process’s view of a particular portion of 
pages and information on getting to those pages. Thepregion points to 
the region data structure that describes the pages’ physical locations in 
memory or in secondary storage. Thepregion also contains the virtual 
addresses to which the process’s pages are mapped, the page usage (text, 
data, stack, and so forth), and page protections (read, write, execute, and 
soon). 
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Figure 1-14 Virtual memory elements of the pregion 



The following elements of a per-process pregion structure are 
important to the virtual memory subsystem. 

Table 1-8 Principal elements of struct pregion 


E lement 

Purpose 

P—type 

Type of pregion 

*p_reg 

Pointer to the region attached by the pregion. 

P_space, 

p_vaddr 

Virtual address of the pregion, based on virtual 
space and virtual offset. 

p_of f 

Offset into the region, specified in pages. 

p_count 

Number of pages mapped by the pregion. 

p_age remain, 
p_agescan, 
p_stealscan, 
p_bestnice 

Used in the vhand algorithm to age and steal 
pages of memory (discussed later). 

*p_vas 

Pointer to the vas to which the pregion is 
linked. 

p_forw, 
p_back 

The doubly-linked list, used by vhand to walk the 
active pregions. 
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Table 1-9 


E lement 

Purpose 

p_deactsleep 

The address at which a deactivated process is 
sleeping. 

p_pagein 

Size of an I/O, used for scheduling when moving 
data i nto memory. 

p_strength, 

p_nextfault 

Used to track the ratio between sequential and 
random faults; used to adjust p_pagein. 


The Region, a system resource 

The region is a system-wide kernel data structure that associates groups 
of pages with a given process. Regions can be one of two types, private 
(used by a single process) or shared (ableto be used by morethan one 
process). Space for a region data structure is allocated as needed. The 
region structure is never written to a swap device, although its B-tree 
may be. 

Regions are pointed to by pregions, which are a per-process resource. 
Regions point to the vnode where the blocks of data reside when not in 
memory. 


region (struct region) 


Element 

Meaning 

r_flags 

Region flags (enumerated shortly). 

r_type 

• rt_private : Multiple processes cannot 
share region. pt_data and pt_stack 
pregions point to rt_private regions. 

• rt_shared: Multiple processes can share 
region. pt_shmem and most pt_text 
pregions point to rt_shared regions. 

r_pgsz 

Size of region in pages if all pages are in 
memory. 

r_nvalid 

Number of valid pages in region. Thisequalsthe 
number of valid vfds in the B-tree or 
b_chunk. 
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Element 

Meaning 

r_dnvalid 

Number of pages in swapped region. If the 
system swaps the enti re process, the val ue of 
r_nvaiid is copied hereto later calculate how 
many pages the process will need when it faults 
back in. This information is used todecide 
which process to reactivate. 

r_swalloc 

Total number of pages reserved and allocated for 
this region on the swap device. Does not account 
for swap space allocated for vfd/dbd pairs. 

r_swapmem, 
r_vf d_swapmem 

Memory reserved for pseudo-swap or vfd swap. 

r_lockmem 

Number of pages currently allocated to the 
region for lockable memory, including lockable 
memory allocated for vfd/dbd pairs. 

r_pswapf, 

Forward and backward pointers to lists of 

r_pswapb 

pseudo-swap pages. 

r_refcnt 

Number of pregions pointing at the region 

r_zomb 

Set to indicate modified text. If an executing 
a.out file on a remote system has changed, the 
pages are flushed from the processor's cache, 
causing the next attempted access to fault. The 
fault handler finds that r_zomb is non-zero, 
prints the message pid %d killed due to 
text modification or page I/O error and 
sends the process a sigkill. 

r_of f 

Offset into the page-aligned vnode, specified in 
pages; valid only if rf_unaligned is not set. 
Page r_of f of the vnode is referenced by the 
first entry of the first chunk of the region's 

B-tree. 

r_incore 

Number of pregions sharing the region whose 
associated processes have the sload flag set. 
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Element 

Meaning 

r_mlockcnt 

N umber of processes that have locked this region 
in memory. 

r_dbd 

Disk block descriptor for B-tree pages written 
to a swap device Specifies the location of the 
first page in the linked list of pages. 

r_fstore, 
r_bstore 

Pointers to vnode of origin and destination of 
block. This data depends on the type of 
pregion above the region. 1 n most cases, 
r_bstore is set tothe paging system vnode, the 
global swapdev_vp that is initialized at system 
startup. 

r_fo rw, 
r_back 

Pointers to linked list of all active pregions. 

r_hchain 

Hash for region. 

r_lock 

Region lock structure used to get read or 
read/write locks to modify the region structure. 

r_mlock 

Wait for region to be locked in memory. 

r_poip 

Number of page I/Os in progress 

r_root 

Root of B-tree; if referencing more than one 
chunk, r_key is set to dontuse_idx. 

r_key, r_chunk 

Used instead of B-tree search if referencing 32 
or fewer pages. 

r_excproc 

Pointer totheproc tableentry, if the process 

has RF_EXCLUSIVE set in r_flags. 

r_hdl 

Hardware-dependent layer structure 

r_next, r_prev 

Circularly linked list of all regions sharing 
pages/vnode. 
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Table 1-10 


Element 

Meaning 

r_pregs 

List of pregions pointing to the region. 

r_lchain 

Linked list of memory lock ranges 

r_mlockswap 

swap reserved to cover locks 


a. out Support for Unaligned Pages 

Text and data of most executables start on a four-kilobyte page boundary. 
HP-UX can treat these as memory-mapped files, because a page in the 
file maps directly to a page in memory. 

I n addition to the fields shown, struct region has fields to support 
executables compiled on older versions of H P-UX whose text and data do 
not align on a (4 KB) page boundary. These executables are referenced by 
regions whose r_fiag is set to rf_unaligned . 

a. out support by regions 


Element 

Meaning 

r_byte, r_bytelen 

Offset into the a. out file and length of its 
text. 

r_hchain 

Hash list of unaligned regions. 


Region flags 

Various indicators of the state of the region are specified in r_fiags. 
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Table 1-11 


Region flags 


Region flag 

Meaning 

RF_ALLOC 

Always set because HP-UX regions are 
allocated and freed on demand; there is no 
free list. 

RF_MLOCKING 

1 ndicator of whether a region is locked; set 
before r_miock, cleared after r_miock 
is released. 

RF_UNALIGNED 

Set if text of an executable does not start 
on a page boundary. 1 n this case, the text 
is read through the buffer cache to align 
it, and the vf ds are poi nted at the buffer 
cache pages. 

RF_WANTLOCK 

Set if another stream wanted to lock this 
region, but found it already locked and 
went to sleep. After the region is 
unlocked, this flag ensures that 
wakeupO is called so the waiting 
stream(s) can proceed. 

RF_HASHED 

The text is unaligned (rf_unaligned) 
and thus is on a hash chain. The region is 
hashed with r_fstore and r_byte; the 
head of each hash chain is in texts []. 

The rf_unaligned flag may be set 
without the rf_hashed flag (if the 
system tries to get the hashed region but 
it is locked, the system will create a 
private one), but the rf_hashed flag will 
never beset without the rf_unaligned 
flag. 

RF_EVERSWP, 

RE_NOWSWP 

Set if the B-tree has ever been or is now 
written to a swap device. These flags are 
used for debugging. 

RF_IOMAP 

This region was created with an iomap () 
system call, and thus requires special 
handling when calling exit(). 
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Region flag 

Meaning 

RF_LOCAL 

Region is swapped locally. 

RF_EXCLUSIVE 

The mapping process is allowed exclusive 
access to the region. This flag is set, and 
r_excproc is set to the proc table 
poi nter. 

RF_SWLAZYWRT 

If an a.out is marked exec_magic, a lazy 
swap algorithm is used, meaning swap is 
not reserved or allocated until needed. 

The text file is not likely to be modified, 
but if it is, a page of swap will be reserved 
for it at that time. 

RF_S TATIC_P REDIC T 

Text object uses static branch prediction 
for compiler optimization. 

RF_ALL_MLOCKED 

Entire region is memory locked, as a 
result of a piock having been performed 
on the pregion associated with the region. 

RF_SWAPMEM 

Region is using pseudo-swap; that is, a 
portion of memory is being held for swap 
use. 

RF_LOCKED_LARGE 

Region is using large pages; used with 
super pages. 

RF_SUPERPAGE_TEXT 

Text region using large pages. 

RF_FLIPPER_DISABLE 

Disable kernel assist prediction; a flag 
used for performance profiling. 

RF_MPROTECTED 

Some part of the region is subject to the 
system call mprotect, which is performed 
on an memory-mapped file. 
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pseudo-vas for Text and Shared Library 

pregions 

When a file is opened as an a.out or shared library, the easiest way to 
keep track of the region is to create a pseudo-vas the first time the file 
is opened as an executable. This is done by calling mapvnode () and 
storing the vas pointer in the vnode's v_vas element. On subsequent 
opens of the file as an executable, the non-NULL value in v_vas aids in 
finding the region to which the virtual address space is being attached. 

The pseudo-vas is type p t_mmap , and the associated pregion has 
pf_pseudo set in p_fiags. This pregion is attached to the region for 
this vnode. All the processes that use this executable or shared library 
(non-pseudo pregions) then attach to the region with type pt_text 
(a. out) or p t_mmap (shared library). The number of processes using a 
particular vnode as an executable is kept in the pseudo-vas in 
va_refcnt. 

All pregions associated with a region are connected with a 
doubly-linked list that begins with the region element r_pregs, and is 
defined in the pregions by p_prpnext and p_prpprev. Thelist issorted 
by p_off, the pregion's offset into the region, and is N U LL-terminated. 

Even after all processes using the a. out or shared library exit, the 
handle to the region remains; its pages can be disposed of at that time. 
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Figure 1-15 Mapping the pseudo-vas structures 


a.out shlib 

vnode vnode 



Chunks - Keeping the vfds and dbds together 
in one place 

Since information is typically needed about groups of (rather than 
individual) pages, pages are grouped into chunks. A chunk contains 32 
pairs of virtual frame descriptors and disk block desciptors: 

• The kernel looks for a pagein memory by its virtual frame descriptor 
(vfd). 

• The kernel looks for a page on disk by its disk block descriptor (dbd). 

• By definition, if thevfd'spg_v bit is set, the vfd is used; if not, the 
dbd is used. 
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A one-to-one correspondence is maintained between vfd and dbd 
through the vfddbd structure, which simply contains one vfd (c_vfd) 
and one dbd (c_dbd). 


Figure 1-16 A chunk contains 32 vfddbd (256 bytes) 

chunks 



HP-UX regions use chunks of vfds and dbds to keep track of page 
ownership: 

• For assignment from virtual page to physical page if the page is valid. 
(This is required in addition to the PDIR. The term "assignment" is 
used (rather than mapping) because the page might not betranslated 
but valid. 

• Other virtual attributes of the page (such as whether the page is 
locked in memory, or whether it is valid). 

• Location on disk for front-store and back-store pages. 

Virtual Frame Descriptors (vfd) 

A one-word structure called a virtual frame descriptor enables processes 
to reference pages of memory. The vfd is used when the process is in 
memory, and can be used to refer to the page of memory descri bed i n 

pfdat. 
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Figure 1-17 Virtual frame descriptor (vfd) contents 


flags 


page frame number 


0 


11 


31 


Table 1-12 Virtual Frame Descriptor (struct vfd) 


E lement 

Meaning 

pg_v 

Val id flag. 1 f set, this page of memory contai ns val id 
data and pg_pfnum is valid. If not set, the page's 
valid data is on a swap device. 

pg_cw 

Copy-on-write flag. If set, a write to the page causes 
a data protection fault, at which time the system 
copies the page. 

pg_lock 

Lock flag. If set, raw I/O is occurring on this page. 
Either the data is being transferred between the 
page and the disk, or data is being transferred 
between two memory pages. The kernel sleeps 
waiting for completion of I/O before launching 
further raw I/O to or from this page. Nothing can 
read the page while it is being written to disk. 

pg_mloCk 

If set, the page is locked in memory and cannot be 
paged out. 

pg_pfnum 

(aliased as 

pg_pfn) 

Page frame number, from which can be accessed the 
correct pfdat entry for this page. 


Disk Block Descriptor (dbd) 

When the pg_v bit in a vfd is not set, the vfd is invalid and the page of 
data is not in memory but on disk. I n this case, the disk block descriptor 
(dbd) gives valid reference to the data. Like the vfd structure, the dbd 
is one word long. 
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Figure 1-18 Contents of disk block descriptor (dbd) 


type 


data 


0 3 31 

Table 1-13 Disk Block Descriptor (struct dbd) 


Element 

Meaning 

dbd_type 

One of six three-bit flags used to interpret dbd_data: 

• dbd_none: No copy of this data exists on disk. 

• dbd_fstore, dbd_bstore: Page can be found on a 
"front or back store" device, pointed to by a region’s 
vnode. a 

• dbd_dfill: This is a demand-fill page. No space is 
allocated; when a fault occurs it is initialized by 
filling it with data from disk. 

• dbd_dzero: This is a demand zero page; when 
requested, allocate a new page and initialize it with 
zeroes. 

• dbd_hole : U sed for a sparse memory-mapped fi 1 e; 
when read, the page gives zeros. When written to, a 
page is allocated, initialized to zero, data inserted, at 
which time the dbd type changes to dbd_none. 

dbd_data 

vnode type (nfs, ufs) specific data. A pointer points 
to data in a file pointed to by a vnode. 


a. When the dbd_type is dbd_fstore, it means that the page 
of data resides in the file pointed to by v_f store (typically a 
file system). When the dbd_type is dbd_bstore, the page of 
data resides in the file or device file pointed to by r_bstore 
(typically a swap device). 
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Balanced T rees (B-T rees) 

Each region contains either a single array of vfd/dbd (chunk) or a 
pointer to a B-tree. The structure called a B-tree allows for quick 
searches and efficient storage of sparse data. A bnode is the same size 
as a chunk; both can be gotten from the same source of memory. The 
region's B-tree stores pairs of page indices and chunk addresses. H P-UX 
uses an order 29 B-tree. 

A B-tree is searched with a key and yields a value. I n the region 
B-tree, the key is the page number in the region divided by 32, the 
number of vfddbds in a chunk. 


Figure 1-19 A sample B-tree (order = 3, depth = 3) 



Each node of a B-tree contains room for order+1 keys (or index 
numbers) and order+2 values. If a node grows to contain more than 
order keys, it is split into two nodes; half of the pairs are kept in the 
original node and the other half are copied to the new node. TheB-tree 
node data also includes the number of valid elements contained in that 
node. 
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Table 1-14 


Table 1-15 


B-tree Node Description (struct bnode) 


Element 

Meaning 

b_key[B_SIZE] 

The array of keys used for each page index of 

the bnode. 

b_nelem 

Number of valid keys/values in the bnode. 

b_down[B_SIZE+1] 

The array of values in the bnode, either 
pointers to another bnode (if this is an interior 
bnode) or pointers to chunks (if this is a leaf 

bnode). 

b_scrl, b_scr2 

bnode padding to the size of a chunk, to allow 
bnodes and chunks to be allocated from the 
same pool of memory. 


Root Of the B-tree 

A structure called struct broot points to the start of the B-tree. 
Struct broot 


Element 

Meaning 

b_root 

Pointer to the initial point of the B-tree. 

b_depth 

Number of levels in the B-tree 

b_npages 

Pages used to construct the B-tree, counting both 
pages used for chunks and bnodes. 

b_rpages 

Number of real pages in the region; swap pages 
reserved for the B-tree by the kernel, using the 
routine vfdpgs (). Amount of swap allocated for 
the vfd/dbd pairs in the B-tree structure. 

b_list 

Pointer to a linked list of memory pages from which 
new bnodes or chunks can be added tothe B-tree. 

b_nfrag 

Number of the next chunk available, derived from 
the unused 256-byte fragments in b_iist. 

b_rp 

Pointer tothe region using the B-tree. 
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Element 

Meaning 

b_protoidx, 
b_protol, 
b_proto2 

Stores page index of default dbd and prototype to 
minimize time and memory costs to allocate chunk 
space. 

b_vproto 

List of page ranges whose bits are marked copy on 
write. 

b_key_cache 

t 

b_val_cache 

Caches of most recently used keys and pointers to 
chunks associated with the keys; checked first when 
querying the virtual memory subsystem. 


vfd/dbd prototypes 

The struct vfdew governs the vfd prototype, 
vf d prototype (struct vf dew) 


E lement 

Meaning 

v_start [ MAXVPROTO] 

Page that indexes start of copy-on-write 
range; set to -1 if unused. 

v_end [ MAXVPROTO] 

End of copy-on-write range 


Hardware-1 ndependent Page I nformation 
table (pfdat) 

The hardware independent layer of the virtual memory subsystem 
manages pages in memory, pages written to swap devices, and the 
movement of pages from one to the other. The act of movi ng data from 
physical memory to a swap device, or moving data from a swap device to 
physical memory, is called paging. 

Basic to hardware independence is the page frame data table (pfdat), a 
big array indexed directly through the page number. Each page of 
available memory is represented by one pfdat structure; one pfdat 
entry represents each page frame writable to a swap device. HP-UX 
never pages kernel memory (the pages containing kernel text, stack, and 
data); thus, pfdat manages only the subset representing freely 
avail able physical memory. When the pfdat is initialized, all free pages 
are linked in a list pointed to by phead. 
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Table 1-17 


Table 1-18 


Principal entries in struct pfdat (page frame data) 


Element 

Meaning 

pf_hchain 

Hash chain link. 

pf_flags 

Page frame data flags (shown in the next 
table). 

pf_pfn 

Physical page frame number. 

pf_use 

Number of regions sharing the page; when 
pf_use drops to zero, the page can be placed 
on the free linked list. 

pf_devvp a 

vnode for swap device. 

pf_data 

Disk block number on swap device. 

pf_next, pf_prev 

Next and previous free pfdat entries. 

pf_cache_waiting 

If set, this element means that a thread is 
waiting to grab the pfjock on that page. 
Required for synchronization. 

pf_lock 

Lock pfdat entry (beta semaphore), used to 
lock the page while modifying thepde 
(physical-to-virtual translation, access rights, 
or protection 1D) 

pf_hdl 

Hardware dependent layer elements (see 
hdi_pfdat discussion, shortly). 


a. Hashing is doneon thetuple (pf_dewp, pf_data). 


Flags showing the Status of the Page 

Principal pf_flag values 


Flag 

Meaning 

P_QUEUE 

Page is on the free queue, headed by phead. 

P_BAD 

Page is marked as bad by the memory 
deallocation subsystem. 

P_HASH 

Page is on a hash queue; contains head of queue. 
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Flag 

Meaning 

P_ALLOCATING 

Page is being allocated; prevents another process 
from taking the page while it is being remapped. 

P_SYS 

Page is being used by the kernel rather than by a 
user process. Pages marked with this flag 
include dynamic buffer cache pages, B-tree 
pages and the results of kernel memory 
allocation. They are used by the kernel for critical 
data structures in addition to the kernel static 
pages that were not included in pfdat. 

P_DMEM 

Page is locked by the memory diagnostics 
subsystem; set and cleared with an iocti () call 
tothedmem driver. 

P_LCOW 

Page is being remapped by copy-on-write. 

P_UAREA 

Page is used by a pregion of type pt_uarea. 


Hardware-Dependent Layer page frame data entry 

If pf_hdi is referenced in struct pfdat, the struct hdlpfdat (defined 

in hdl_pfdat. h) is used. pf_hdl is a type of struct hdlpfdat. 
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Table 1-19 


struct hdlpfdat 


Element 

Meaning 

hdlpf_flags 

Flags that show the HDL status of the page: 

• hdlpf_trans: A virtual address translation 
exists for this page. 

• hdlpf_protect: Page is protected from user 
access. This flag indicates that the saved 
values are valid. 

• hdlpf_steal: Virtual translation should be 
removed when pending I/O is complete. 

• hdlpf_mod: Analogous to changing the 
pde_modified flag in thepde. 

• hdlpf_ref: Analogous to changing the 
pde_ref flag in thepde. 

• hdlpf_reada: Read-ahead page in transit; 
used to indicate to the hdi_pfauit () 
routine that it should start the next I/O 
request before waiting for the current I/O 
request to complete. 

hdlpf_savear 

Saved page access rights. 

hdlpf_saveprot 

Saved page protection 1D. 
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MAPPING VIRTUAL TO PHYSICAL 
MEMORY 

ThePA-RISC hardware attempts to convert a virtual address toa 
physical address with the TLB or the block TLB. If it cannot resolve the 
address, it generates a page fault (interrupt type 6 for an instruction 
TLB miss fault; interrupt type 15 for a data TLB miss fault). The kernel 
must then handle this fault. 

PA-RISC uses a hashed page table (htbl) to pinpoint an address in the 
enormous virtual address space. A ratio is kept of PDIR to hash table 
entries, depending on specific PA-RI SC implementation (see cpu. h). 
Control register 25 (CR25) contains the hash table address (see reg.h). 

Figure 1-20 Contents of the htbl index 


Space 


Offset 



The HTBL 

The algorithm for converting a virtual address to a physical address 
depends on the particular processor. 

Likewise, the algorithm for choosing the size of htbl has been developed 
empirically as the result of performance tests. 

• Each graphics driver estimates how many entries it will need for I/O 
mapping. This number is stored in niopdir. 

• The kernel first approximates nhtbi asthesumof niopdir andthe 
number of pages of RAM (physmem). 
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• nhtbi is now adjusted to a power of two and rounded appropriately. 

Figure 1-21 Mapping from the htbl entry to the page directory entry 



The index generated by the hashing algorithm is now used as an index 
into htbl. Each entry in the table is referred to as a pde (page directory 
entry), and is of type struct hpde. 

The virtual space and offset are compared to information in the pde to 
verify theentry. Thephysical address is retrieved from the pde to 
complete the translation from virtual address to physical address. 

When multiple addresses hash to the same htbl entry 

As with any hash algorithm, multiple addresses can map to the same 
htbl index. Theentry in htbl is actually the starting point for a linked 
list of pdes. Each entry has a pde_next pointer that points to another 
pde, or contains NULL if it is the last item of the linked list. 

Each htbl entry can point to two other collections of pdes, ranging from 
base_pdir to htbl and from pdir (which is also the end of htbl) to 
max_pdir. The entirety of the htbl and surrounding pdes is referred 
to collectively as the sparse PDIR. htbl is always aligned to begin at an 
address that is a multiple of its size (that is, a multiple of nhtbi * 
sizeof (struct hpde)). 
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In practice, htbl contains sufficient entries, as that the linked lists 
seldom grow beyond three links. pdir_free_iist points to a linked 
list of sparse PDIR entries that are not being used and are avail able for 
use. pdir_f ree_iist_taii poi nts to the last pde on that linked list. 

Figure 1-22 How multiple addresses hash to the same htbl entry 



Mapping Physical to Virtual Addresses 

HP-UX uses a hashed page directory to translate from virtual to physical 
address. 

The pfdat table maps physical to virtual addresses. Inverse 
translations from physical tovirtual usethepfn_to_virt_tabie [ ], an 
array that contains entries of either space and offset page (in the case of 
a single translation to a page) or a list of alias structures (when the 
physical page has more than one virtual address translation). 

The hashed pdir stores the physical address in the translation "bucket” 
(hash table) to disassociate the physical page number from that page's 

pde. 
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Figure 1-23 Physical-to-virtual address translation 



The pfn_to_virt_tabie may contain the space.offset (virtual 
address) corresponding to a physical address or it may have a pointer to 
a link list of alias structures, each of which has a space.offset pair. 


Address Aliasing 

HP-UX supports software address aliasing for text only of exec_magic 
executables. (Whereas the hardware implements address aliasing on 
1M B boundaries, software address aliasing is implemented on a per-page 
basis; pages are 4KB apart.) 

When a text segment is first translated, it has no alias. However, if a 
process or thread attaches to the same text segment, it may require 
another translation. Processes sharing text segments do not use aliases. 
Only processes with private text segments that share data pages using 

copy-on-wirite use aliases. 

When multiple virtual addresses translate to the same physical address, 
HP-UX uses alias structures to keep track of them. Aliases for a page 
frame (pfn) are maintained via alias chains off the 
pfn_to_virt_table [ ]. When a pf n_to_virt_table'S space field is 
invalid and the offset field is non-zero, the non-zero value points tothe 
beginning of a linked list of alias structures. Each alias structure 
contains the space and offset of the alias, and a temporary hold field for a 
pde's access rights and access ID. The pf dat_iock of the alias's pfn 
protects the alias chain from being read and modified. 

To locate the pde for a particular alias space and offset, the space and 
offset are hashed for the pde chain and its corresponding pdiock. Once 
thepdiock is obtained, the vtopde () routine walks the pde hash chain 
to find a match of the tag. 
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The aa_entf reeiist is the head of the doubly-linked list of free alias 
entries. Thesystem gets an alias structure from aa_entfreeiist, in 
which it stores the information for this new virtual-to-physical 
translation. 

The global variable max_aapdir contains thetotal number of alias pdes 
on the system. Once a page is allocated for use as alias pdes, it is not 
returned, so the value of max_aapdir may grow over time but will never 
shrink. 

The number of available alias pdes is stored in aa_pdircnt. When an 
aliaspde is used or reserved (we reserve one if we include an htbl pde 
in an alias linked list, in case we have to move it later), aa_pdircnt is 
decremented. When an alias pde is returned to aa_pdirf reeiist or 
unreserved, aa_pdircnt is incremented. 

The number of available alias structures is kept in aa_entcnt. Once a 
page is allocated for use as a group of alias structures, it is not returned. 
We do not keep track of thetotal number of alias structures on the 
system, just the number of available structures. 
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MAINTAINING PAGE AVAILABILITY 

Two computational elements maintain page availability: 

• Paging thresholds trigger the gamut of paging events. 

• The vhand and sched daemons (system processes) handle the actual 
paging and deactivation. 

vhand monitors free pages to keep their number above a threshold and 
ensure sufficient memory for demand paging, vhand governs the overall 
state of the paging system, sched becomes operative when the number 
of pages available in memory diminishes below a certain level, vhand 
and sched will be described in the context of their work shortly. 

NOTE The sched process is known colloquially as the swapper. 


Paging Thresholds 

Memory management uses paging thresholds that trigger various paging 
activities. The figure shows the full range of available memory and 
indicates what paging activity occurs when memory level falls below 
each paging threshold. 


Figure 1-24 Available memory in the system 

total memory at boot-up — 


freemem 


kernel static memory 


physmem 



lotsfree 


gpgslim* 

desfree 

minfree 
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fluctuates between desfree and lotsfree 
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Table 1-20 


The value termed freemem represents the total number of free pages in 
thephead linked list, which includes all memory available in a system 
after kernel initialization. 

Three tunable paging thresholds are initialized by the 

setmemthresholds () routine. 


setmemthresholds () paging thresholds 


Paging 

threshold 

Meaning 

lotsfree 

Plenty of free memory, specified in pages. The upper 
bound from which the paging daemon will begins to 
steal pages. 

desfree 

Amount of memory desired free, specified in pages. 
This is the lower bound at which the paging daemon 
begi ns steal i ng pages. 

minfree 

The minimal amount of free memory tolerable, 
specified in pages. If free memory drops below this 
boundary, sched() recognizes the system is 
desperate for memory and deactivates enti re 
processes whether they are runnable or not. 


The gpgslim Paging Threshold 

The gpgslim paging threshold is the point at which vhand starts 
paging, gpgslim adjusts dynamically according to the needs of the 
system. It oscillates between an upper bound called lots free and a 
lower bound called desfree. Both lotsfree anddesfree are 
calculated when the system boots up and are based on the size of system 
memory. 

When the system boots, gpgslim is set to 1/4 the distance between 
lotsfree and desfree (desfree + (lotsfree - desfree) / 4 ). As 

the system runs, this value fluctuates between desfree and lotsfree. 
When the sum of available memory and the number of pages scheduled 
for I/O (soon to be freed) falls below gpgslim, vhand () begins aging 
and stealing little-used pages in an attempt to increase the available 
memory above this threshold. 
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Table 1-21 


The system wants to keep memory at gpgslim. I f the system is not 
stressed, gpgslim starts rising, because it does not need to have a lot 
more pages freed. As memory becomes more scarce, the system tries to 
maintain the pool of free memory, causing gpgslim to fall. If gpgslim 
decreases to minfree, the system starts to deactivate entire processes. 

How Memory Thresholds are Tuned 

Performance testi ng has shown that memory usage differs for a server 
versus a workstation. Workstations typically run a few large applications 
whereas servers typically run many applications of varying size. 
Consequently, the paging and deactivation thresholds on workstations 
are a smaller fraction of memory than on the servers. I n a typical 
workstation environment, applications start up requiring a large 
number of pages, which eventually reduce to a smaller working set of 
pages. By allowing applications to claim more memory before paging or 
deactivating, the working set is more likely to stay in memory. 

Paging and activation algorithms take these and other differences into 
account. Depending on the physical memory size of the system, the 
paging thresholds are initialized to either a "small memory" or "large 
memory" set of val ues. 

Small Memory Thresholds 

For small memory systems (that is, systems with 32M B or less of 
freemem), the paging thresholds are set to a smaller fraction of total 
memory to allow applications to utilize more memory before the system 
begins paging and deactivating. The paging thresholds are set as 
follows: 


Small-memory paging thresholds 


Threshold 

Limit 

Not to exceed 

lotsfree 

1/8 freemem 

1MB 

desfree 

1/16 freemem 

240 KB 

minfree 

1/2 desfree 

100 KB 
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Large Memory Thresholds 

For large memory systems (that is, systems with greater than 32 MB of 
freemem), the paging thresholds are set to a larger fraction of memory to 
allow vhand () to start paging earlier so that it can efficiently walk a 
(potentially) longer active pregions list. This also helps sched () 
process a potentially longer active process list by starting process 
deactivation earlier. The paging thresholds are set as follows: 


Large-memory paging thresholds 


Threshold 

Limit 

Capacity if 

freemem 

<2 GB 

Capacity if 

freemem 

>2 GB 

lotsfree 

1/16 freemem 

32 MB 

64 MB 

desfree 

1/64 freemem 

4MB 

12 MB 

minfree 

1/4 desfree 

1 MB 

5MB 


These settings result in a linear increase of the paging thresholds up to a 
certain memory size, after which the thresholds remain fixed. For 
example, lotsfree increases linearly and reaches its maximum value of 
32 MB when freemem is 512 MB. For memory sizes beyond 512 MB, 
lotsfree remains fixed at 32 M B. This results in the system paging 
earlier for smaller memory configurations and later for larger sizes. 

When physical memory sizes exceed 2 GB, all the paging thresholds are 
increased to a larger set of fixed values. 

How Paging is Triggered 

The rate schedpaging () runs is termed vhandrunrate, a tunable 
parameter (set to run by default at eight times per second) activated 
when the sum of free memory and paroled memory (freemem + 
parolemem) is less than lotsfree. 

vhand, the pageout daemon 

Programmatically, vhand is awakened by schedpaging () periodically 
to maintain recently referenced pages and to move pages out when 
memory is tight, vhand operates on the basis of vhandargs_t, which 
consists of a pointer to the target pregion, a count of the physical pages 
visited, and a nice value for preferential aging. 
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vhand can also be awakened by aiiocpfd2 () (in vm_page. c), a routine 
that allocates a single page of memory. 

If all the pages on thefree memory list (phead) are locked, or the routine 
has been called while using the interrupt control stack (ics) and all 
pages on thefree list are also in the page cache (phash), aiiocpfd2 o 
cannot get any pages. 

If on the ics without any available pages, aiiocpfd2 o wakesthepage 
daemon. Regardless of which stack the system is running on, 
aiiocpfd2 () then wakes up unhashdaemon, which removes pages 
from the page cache. 

If on the ics, allocpfd2 o returns NULL; if not on the ics, 
aiiocpfd2 () sleeps waiting for a page to become available, and then 
retry. 

Two-Handed Clock Algorithm 

A doubly linked list of pregions, termed the active pregion list, is used 
by vhand to examine memory availability. Conceptually, the pregions 
can be visualized as being linked in a circle, in the center of which are 
two clock-like hands. The two hands function as a steal hand and an age 
hand. 

• A steal hand removes pages whose reference bits remain clear since 
the most recent pass of the age hand. 

• An age hand clears reference bits on in-core pages in an active 

pregion. 

The kernel automatically keeps an appropriate distance between the 
hands, based on the available paging bandwidth, the number of pages 
that need to be stolen, the number of pages already scheduled to be freed, 
and the frequency by which vhand runs. 
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Figure 1-25 Two-handed vhand clock algorithm, showi ng also the factors that 

affect vhand 



The two hands cycle through the activepregion linked lists of physical 
memory to look for memory pages that have not been referenced recently 
and move them to secondary storage - the swap space. Pages that have 
not been referenced from the ti me the age hand passes to the ti me the 
steal hand passes are pushed out of memory. The hands rotate at a 
variable rate determined by the demand for memory. 

The vhand daemon decides when to start paging by determining how 
much free memory is available. Once free memory drops below the 
gpgslim threshold, paging occurs, vhand attempts to free enough pages 
to bring the supply of memory back up to gpgslim. Between gpgslim 
and lotsfree, the page daemon continues to age pages (that is, clear 
their reference bits) but no longer steals pages. 

Factors Affecti ng vhand 

vhand responds to various workloads, transient situations, and memory 
configurations. When aging and stealing from regions, vhand 

• ages some constant fraction of each pregion. 
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• uses the pregion field p_agescan to track the last age hand 
location. 

• uses the pregion field p_ageremain to track remaining pages to be 
aged. 

• uses the pregion field p_steaiscan to track the last steal hand 
location. 

• pushes vfd/dbd pairs to swap if they have no valid pages. 

When the age hand arrives at a region, it ages some constant fraction of 
pages before moving to the next region (by default 1/16 of the region's 
total pages). The p_agescan tag enables the age hand to move to the 
location within a pregion where it left off during its previous pass, 
whilethe p_ageremain charts how many pages must be aged to fill the 
1/16 quota before moving on to the next pregion. 

Thesteal hand uses the pregion field p_steaiscan to locate itself 
within a pregion and resume taking pages that have not been 
referenced since last aged. If no valid page remain, vhand pushes out of 
memory the vfd/dbd pairs associated with the region. 

How much to age and steal depends on several factors: 

• frequency of vhand runs (by default eight times per second). 

• available paging bandwidth (based on comparison with a global rate 
of pageouts completed within an interval of time). 

• how often the system fal Is to zero free memory. 

• position of the paging threshold gpgslim. 

• number of pages already scheduled to be freed. 

vhand is biased against threads that have nice priorities: the nicer a 
thread, the more likely vhand will steal its pages. The pregion field 
p_bestnice reflects the best (numerically, the smallest value) nice 
value of all threads sharing a region. 

What Happens when vhand Wakes Up 

Refer to the table that follows for explanations of the vhand variables. 

• vhand establishes pagecounts for pages to age and pages to steal, and 
sets the coalescecnt to zero. 
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• vhand uses the scritical flag to get access to the system critical 
memory pool. (The scritical flag for the vhand process is set when 
the process starts running for the first time.) 

• vhand increments the value coalescecnt and compares it to the value 
coalescerate. If coalescecnt is higher, vhand attempts to 
remove pages from kernel allocation buckets until freemem is above 
lotsfree. Then vhand resets coalescecnt to zero. 

• Next vhand updates the value of gpgslim, based on value of 

memzeroperiod. 

• vhand updates pageoutrate, using pageoutcnt. 

• vhand updates targetiaps, the number of desired laps between the 
ageand steal hands. If less CPU cycles are being used than the 
value Of targetcpu, vhand increases the value of targetiaps (up 
to a maximum of 15); if more CPU cycles are being used than 

targetcpu, targetiaps is decreased. 

• vhand updates agerate, the number of pages to age per second. 

• If vhandinfoticks is non-zero, diagnostic information prints to the 
console. 


NOTE None of the variables in thetablethat follows may be tuned. 
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Table 1-23 


Variables affecting vhand 


Variable 

Purpose 

coalescerate 

H ow often vhand () attempts to reel ai m 
unused memory from the kernel allocation 
buckets, beginning at 128; that is, every 128th 
time vhand runs, it attempts to return memory 
to the system. 

• If successful, vhand resets coalescerate 
to every 128th time. 

• If unsuccessful, vhand multiplies 
coalescerate by two (checks memory half 
as often) up to every 512th time. 

memzeroperiod 

Minimum time period (defaults seconds) 
permissible for freemem to reach zero events; 
determines how often gpgslim is adjusted 
when vhand () is running. 

• gpgslim is incremented if freemem does 
not reach zero twice within 

memzeroperiod. 

• gpgslim is decremented if freemem 
reaches zero twice within memzeroperiod 
slightly above lotsfree. 

pageoutrate 

Current pageout rate, calculated empirically 
from number of pageouts completed. 

pageoutcnt 

Recent count of pageouts completed 

targetlaps 

1 deal gap between steal and age hands for 
handiaps; adapts at run time. During normal 
operation, the hands should be as far apart as 
possi ble to give processes maxi mum ti me to 
reset a cleared reference bit being used by a 
page, targetlaps is defined in the kernel as a 
static variable; it does not appear in the symbol 
table. 
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Variable 

Purpose 

targetcpu 

Maximum percentage of CPU vhand should 
spend paging, (default value is 10%.) 

handlaps 

Actual number of laps between the age and 
steal hands. 

agerate 

N umber of pages the age hand visits to age per 
second; adapts continually to system load. 

These are defined in the kernel as static 
variables (meaning they do not appear in the 
symbol table). 

stealrate 

H ow many pages the steal hand visits per 
second; adapts continually to system load. 

These are defined in the kernel as static 
variables (meaning they do not appear in the 
symbol table). 


vhand Steals and Ages Pages 

Once vhand establishes its criteria, it proceeds to traverse the linked 
list of pregions. Continuing in the clock-hands analogy, vhand is ready to 
move its hands. 

• vhand determines how many pages and what pages are available to 
steal. 

• Next, vhand moves the age hand to clear the reference bit from a 
selected number of pages. 

• If the steal hand is pointing to bufcache_preg, vhand steals 
buffers from the buffer cache with the stealbuffers () routine. 
The global parameter dbc_steal_factor determines how much more 
aggressively to steal buffer cache pages than pregion pages. I f 
dbc_steai_factor has a value of 16, buffer cache pages are 
treated no differently than pregion pages; the default value of 48 
means that buffer cache pages are stolen three times as 
aggressively as pregion pages. 
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• If the steal hand points to a pregion whose region has no valid 
pages (that is, r_nvalid == 0), vhand pushes its B-tree out to 
the swap device. If none of the processes using the region are 
loaded in memory (that is, r_incore == 0), the entire region 
may be swapped out. 

• Otherwise, vhand steals all pages between p_steaihand and 
(p_agescan - p_count/16 * handlaps), up to the Steal quota 
(calculated from stealrate). 

• vhand updates p_steaiscan to the page number following the 
last stolen page of the affected pregion. 

• If vhand has not stolen as many pages as permissible (calculated 
from stealrate), it moves to the next pregion and repeats the 
process until it satisfies the system's demand. 

• If the age hand points tobufcache_preg, vhand ages one 
sixteenth of the pages in the buffer cache with theagebuf fers ( ) 
routine. 

• vhand determines the best nice value (that is, the lowest number) 
of all thepregions using the region. For each pagein the region, 
if the nice value exceeds a randomly generated number, vhand 
does not age the page. 

• Otherwise, vhand ages all pages between p_agehand and 
(p_agehand + p_ageremain) by clearing the pde_ref bit and 

purging theTLB. 

• Finally, vhand updates p_agehand to be the page number after 
the last aged page in the affected pregion. 

Note, the steal hand is moved first to keep it behind the age hand and 
prevent aging and stealing a page in the same cycle. 


Figure 1-26 
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Thesched() routine 

TheschedO routine (colloquially termed "the swapper") handles the 
deactivation and reactivation of processes when free memory falls below 
minfree, or when the system appears to be thrashing. 


NOTE Deactivation occurs on a per-thread basis, sched () chooses to 

deactivate on a process level and then deactivates each thread. 


Deactivation occurs when sched () determines the system: 

• is low on memory; that is, if freemem falls below the deactivation 
threshold minfree and more than one process is running. 

• appears to be thrashing; that is, if the system has a high paging rate 
and low CPU usage. 

Reactivation occurs when the system is no longer low on memory or 

thrashing. 

What to Deactivate or Reactivate 

Deactivation and reactivation are determined by: 

• process priority; the lower the process priority (meaning the higher 
the nice value), the more likely it will be deactivated. The higher the 
process priority, the more likely it will be reactivated. Real-time 
processes are inet igible for deactivation. 

• process size; the larger the process resident set size, the more likely it 
will be deactivated. 

• process state; a process that has been sieepi ng or has been i n memory 
for some time is likely to be deactivated. A process deactivated for a 
while and is now ready to run is likely to be reactivated. 

• process type. A batch process (one that works continuously) or one 
marked for serialization is more likely than an interactive process 
(one that works in spurts) to be deactivated. I nteractive processes are 
more likely to be reactivated than batch or serialized processes. 

• time in current state 
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The swapper deactivates processes and prevents them from running, 
thus reducing the rate at which new pages are accessed. Once swapper 
detects that available memory has risen above minfree and the system 
is not thrashing, the swapper reactivates the deactivated processes and 
continues monitoring memory availability. 

Figure 1-27 sched () chooses processes to deactivate based on size, nice 

priority, and how long it has been running. 



sched () walks the chain of active processes, examining each, and 
deciding the best candidate to be deactivated based on size, nice priority, 
and how long it has been running. 

Programmatically, sched () deactivates and reactivates processes. 

If the system appears to be thrashing or experiencing memory pressure, 
the sched routine walks through the active process list calculating each 
process's deactivation priority based on type, state, size, length of time in 
memory, and how long it has been sleeping. (Batch and processes 
marked for serialization by the serialize () command are more likely 
to be deactivated than interactive processes.) The best candidate is then 
marked for deactivation. 

If the system is not thrashing or experiencing memory pressure, the 
sched routine walks through the active proc list calculating each 
deactivated process' reactivation priority based on how long it has been 
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deactivated, its size, state, and type. Batch processes and those marked 
by the serialize () command are less likely to be reactivated than is 
an interactive process. Once the most deserving process has been 
determined, it is reactivated. 

When a process is deactivated 

Once a process and its pregions are marked for deactivation, sched () 

• removes the process from the run queue. 

• adds its uarea to the active pregion list SO that vhand can page it 
out. 

• moves all thepregions associated with the target process in front of 
the steal hand, so that vhand can steal from them immediately. 

• enables vhand to scan and steal pages from the entire pregion, 
instead of 1/16. 

Eventually, vhand pushes the deactivated process’s pages to secondary 
storage. 

When a process is reactivated 

Processes stay deactivated until the system has freed up enough memory 
and the paging rate has slowed sufficiently to return processes to the run 
queue. The process with the highest reactivation priority is then 
returned to the run queue. 

Once a process and its pregions are marked for reactivation, sched () : 

• removes the process’s uarea from the active pregion list. 

• clears all deactivation flags. 

• brings in the vfd/dbd pairs. 

• faults in the uarea. 

• adds the process to the run queue. 

Self-Deactivation 

Earlier HP-UX implementations did not permit a process to be swapped 
out if it was holding a lock, doing I/O, or was not at a signalable priority. 
Even if priority made it most likely to be deactivated, vhand bypassed 
the process. 
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Now, if the most deserving process cannot be deactivated immediately, it 
is marked for self-deactivation; that is, the process sets a 
self-deactivation flag. The next time the process must fault in a page, it 
deactivates itself. 


NOTE sched() deactivates and reactivates processes. As part of a process's 

deactivation or reactivation, all its threads get deactivated or 
reactivated, sched () does not deactivate or reactivate threads 
individually. 


Thrashing 

Thrashing is defined as low CPU usage with high paging rate. 
Thrashing might occur when several processes are running, several 
processes are waiting for I/O to complete, or active processes have been 
marked for serialization. 

On systems with very demanding memory needs (for example, systems 
that run many large processes), the paging daemons can become so busy 
deactivating/reactivating, and swapping pages in and out that the 
system spends too much time paging and not enough time running 
processes. 

When this happens, system performance degrades rapidly, sometimes to 
such a degree that nothing seems to be happening. At this point, the 
system is said to be thrashing, because it is doing more overhead than 
productive work. 

If your working set is larger than physical memory, the system will 
thrash. To solve the problem, 

• reduce the working set of running processes by deactivation, or 

• increase the size of physical memory. 

If you are left with one huge process constrained with physical memory 
and the system still thrashes, you will need to rewrite the application so 
that it uses fewer pages simultaneously, by grouping data structures 
according to access, for example. 

Serialization 

All processes marked by the serialize command are run serially. This 
functionality unjams the bottleneck (recognizable by process throughput 
degradation) caused by groups of large processes contending for the CPU. 
By running large processes one at a time, the system can make more 
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efficient use of the CPU as well as system memory since each process 
does not end up constantly faulting in its working set, only to have the 
pages stolen when another process starts running. 

As long as there is enough memory in the system, processes marked by 
serialize () behave no differently than other processes in the system. 
However, once memory becomes tight, processes marked by serialize are 
run one at a time in priority order. Each process runs for a finite interval 
of time before another serialized process may run. The user cannot 
enforce an execution order on serialized processes. 

serialize () can be run from the command line or with a pid value, 
serialize () also has a timeshare option that returns the pid 
specified to normal timeshare scheduling algorithms. 

If serialization is insufficient to eliminate thrashing, you will needtoadd 
more mai n memory to the system. 
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SWAP SPACE MANAGEMENT 

Swap space is an area on a high-speed storage device (almost always a 
disk drive), reserved for use by the virtual memory system for 
deactivation and paging processes. At least one swap device (primary 
swap) must be present on the system. 

During system startup, the location (disk block number) and size of each 
swap device is displayed in 512-KB blocks. The swapper reserves swap 
space at process creation time, but does not allocateswap space from the 
disk until pages need to go out to disk. Reserving swap at process 
creation protects the swapper from running out of swap space. You can 
add or remove swap as needed (that is, dynamically) while the system is 
running, without having to regenerate the kernel. 

HP-UX uses both physical and pseudo swap to enable efficient execution 
of programs. 

Pseudo-Swap Space 

System memory used for swap space is called pseudo-swap space. It 
allows users to execute processes in memory without allocating physical 
swap. Pseudo-swap is controlled by an operating-system parameter; by 
default, swapmem_on is set to 1, enabling pseudo-swap. 

Typical ly, when the system executes a process, swap space is reserved for 
the entire process, in case it must be paged out. According to this model, 
to run one gigabyte of processes, the system would have to have one 
gigabyte of configured swap space. Although this protects the system 
from running out of swap space, disk space reserved for swap is 
under-utilized if minimal or no swapping occurs. 

To avoid such waste of resources, H P-UX is configured to access up to 
three-quarters of system memory capacity as pseudo-swap. This means 
that system memory serves two functions: as process-execution space 
and as swap space. By using pseudo-swap space, a one-gigabyte memory 
system with one-gigabyte of swap can run up to 1.75 GB of processes. As 
before, if a process attempts to grow or be created beyond this extended 
threshold, it will fail. 

When using pseudo swap for swap, the pages are locked; as the amount 
of pseudo-swap increases, the amount of lockable memory decreases. 
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For factory-floor systems (such as controllers), which perform best when 
the entire application is resident in memory, pseudo-swap space can be 
used to enhance performance: you can either lock the application in 
memory or make sure the total number of processes created does not 
exceed three-quarters of system memory. 

Pseudo-swap space is set to a maximum of three-quarters of system 
memory because the system can begin paging once three-quarters of 
system available memory has been used. The unused quarter of memory 
al lows a buffer between the system and the swapper to give the system 
computational flexibility. 

When the number of processes created approaches capacity, the system 
might exhibit thrashing and a decrease in system response time. If 
necessary, you can disable pseudo-swap space by setting the tunable 
parameter swapmem_on in /usr/conf/master .d/core-hpux tozero. 

At the head of a doubly linked list of regions that have pseudo-swap 
allocated is a null terminated list called pswaplist. 

Physical Swap Space 

There are two kinds of physical swap space: device swap and file-system 
swap. 

Device Swap Space 

Device swap space resides in its own reserved area (an entire disk or 
logical volume of an LVM disk) and is faster than file-system swap 
because the system can write an entire request (256 KB) to a device at 
once. 

File-System Swap Space 

File-system swap space is located on a mounted file system and can vary 
in size with the system's swapping activity. Flowever, its throughput is 
slower than device swap, because free file-system blocks may not always 
be contiguous; therefore, separate read/write requests must be made for 
each file-system block. 

To optimize system performance, file-system swap space is allocated and 
de-allocated in swchunk-sized chunks, swchunk is a configurable 
operating system parameter; its default is 2048 KB (2 MB). Once a 
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Table 1-24 


Table 1-25 


chunk of file system space is no longer being used by the paging system, 
it is released for filesystem use, unless it has been preallocated with 
swapon. 

If swapping to file-system swap space, each chunk of swap space is a file 
in the file system swap directory, and has a name constructed from the 
system name and the swaptab index (such as becky. 6 for swaptab[6] 
on a system named becky). 

Swap Space Parameters 

Several configurable parameters deal with swapping. 


Configurable swap-space parameters 


Parameter 

Purpose 

swchunk 

The number of dev_bsize blocks in a unit of 
swap space, by default, 2 M B on all systems. 

maxswapchunks 

Maximum number of swap chunks allowed on a 
system. 

swapmem_on 

Parameter allowing creation of more processes 
than you have physical swap space for, by using 
pseudo-swap. 


Swap Space Global Variables 

When the kernel is initialized, conf.c includes globals.h, which contains 
numerous characteristics related to swap space, shown in the next table. 
The most important to swap space reservation are swapspc_cnt, 
swapspc_max, swapmem_cnt, swapmem_max, and sys_mem 


Swap-space characteristics in globals. h 


Element 

Meaning 

bswlist 

head of free swap header list. 

*pageoutbp 

pointer to swbuf header used by pageout 
when swapping. 

ref_hand 

current reference hand used by pageout 
daemon. 
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Element 

Meaning 

maxmem 

page count of actual max memory per process. 

physmem 

page count of physical memory on this CPU. 

nswdev 

number of swap devices. 

nswap 

page count of size of swap space. 

*fswdevt 

pointer to file system swap table. 

*swaptab 

pointer to the table of swap chunks. 

swapphys_buf 

pages of physical swap space to keep available. 

swapphys_cnt 

pages of available physical swap space on disk. 

swapspc_cnt 

Total amount of swap currently available on 
all devices and filesystems enabled in units of 
pages. U pdated each ti me a devi ce or fi 1 e 
system is enabled for swapping. 

swapspc_max 

Total amount of device and file-system swap 
currently enabled on the system in units of 
pages. U pdated each time a device or file 
system is enabled for swapping. 

swapspc_debit 

number of swap blocks by which to adjust 

swapspc_cnt. 

swapspc_sparing 

number of swap blocks unavailable to swap. 

swapmem_max 

Maximum number of pages of pseudo-swap 
enabled. 1 nitialized to 3/4 avail able system 
memory. 

swapmem_cnt 

Total number of pages of pseudo-swap 
currentlyavailable. lnitializedto3/4available 
system memory. 

maxfs_pri 

highest available device priority. 

maxdev_pri 

highest available swap prioirity. 
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Element 

Meaning 

sys_mem 

Number of pages of memory not avail able for 
use as pseudo-swap. 1 nitialized to 1/4 
available system memory. 

sysmem_max 

maximum pages not availablefor swap. 

freemem 

page count of remaining blocks of free memory. 

freemem_cnt 

Number of processes waiting for memory. 


Swap Space Values 

System swap space values are calculated as follows: 

• Total swap availableon the system is swapspc_max (for device swap 
and filesystem swap) +swapmem_max (for pseudo-swap). 

• Allocated swap is swapspc_max - [sum (swdevt [n] . sw_nfpgs) 

+ sum (f swdevt [n] . fsw_nfpgs) ] (for device swap and file system 
swap) + (swapmem_max - swapmem_cnt) (for pseudo-swap). 

I n H P-UX, only data area growth (using sbrk ()) or stack growth will 
cause a process to die for lack of swap space. Program text does not use 
swap. 

Reservation of Physical Swap Space 

Swap reservation is a numbers game. The system has a finite number of 
pages of physical swap space. By decrementing the appropriate 
counters, HP-UX reserves space for its processes. 

M ost U NIX systems and U NI X-l i ke systems al Iocate swap when needed. 
However, if the system runs out of swap space but needs to write a 
process' page(s) to a swap device, it has no alternative but to kill the 
process. To alleviate this problem, HP-UX reserves swap at the time the 
process is forked or exec'd. When a new process is forked or executed, if 
insufficient swap space is available and reserved to handle the entire 
process, the process may not execute. 

At system startup, swapspc_cnt and swapmem_cnt are initialized to 
the total amount of swap space and pseudo-swap available. 
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Whenever the swapon () call is made to a device or file syste, the amount 
of swap newly enabled is converted to units of pages and added to the 
two global swap-reservation counters swapspc_max (total enabled swap) 
and swapspc_cnt (available swap space). 

Each time swap space is reserved for a process (that is, at process 
creation or growth time), swapspc_cnt is decremented by the number of 
pages required. The kernel does not actually assign disk blocks until 
needed. 

Once swap space is exhausted (that is, swapspc_cnt == 0), any 
subsequent request to reserve swap causes the system to allocate 
addition chunk of file-system swap space. If successful, both 
swapspc_max and swapspc_cnt are updated and the current (and 
subsequent requests) can be satisfied. If a file-system chunk cannot be 
allocated, the request fails, unless pseudo-swap is available. 

When swap space is no longer needed (due to process termination or 
shrinkage), swapspc_cnt is incremented by the number of pages freed. 
swapspc_cnt never exceeds swapspc_max and is always greater than 
or equal to zero. If a chunk of file-system swap is no longer needed, it is 
released back to the file system and swapspc_max and swapspc_cnt 
are updated. 

If no device or filesystem swap space is available, the system uses 
pseudo-swap as a last resort. 11 decrements swapmem_cnt and locks the 
pages into memory. Pseudo swap is either free or allocated; it is never 
reserved. 

Swap Reservation Spinlock 

The rswap_iock spinlock guards the swap reservation structures 

swapspc_cnt, swapspc_max, swapmem_cnt, swapmem_max, sys_mem, 
and pswaplist. 

Reservation of Pseudo-Swap Space 

Approximately 3/4 of available system memory is available as 
pseudo-swap space if the tunable parameter swapmem_on is set to 1. 
Pseudo-swap is tracked in the global pseudo swap reservation counters 
swapmem_max (enabled pseudo-swap) and swapmem_cnt (currently 
available pseudo-swap). If physical swap space is exhausted and no 
additional file-system swap can be acquired, pseudo swap space is 
reserved for the process by decrementing swapmem_cnt. 
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For example, on a 64M B system, swapmem_max and swapmem_cnt track 
approximately 48M B of pseudo-swap space, the remainder tracked by 
the global sys_mem, which represents the number of pages reserved for 
system use only. 

Processes track the number of pseudo swap pages allocated to them by 
incrementing a per region counter r_swapmem. All regions using pseudo 
swap are linked on the pseudo swap list pswaplist. Once pseudo swap 
is exhausted (that is, swapmem_cnt==0), attempts at process creation or 
growth will fail. 

Because the swapper competes with the operating system for use of 
memory, swapmem_cnt can also be decremented by the operating 
system for any dynamically allocated memory. Once swapmem_cnt is 
exhausted, subsequent requests for swap space fail; however, the 
operating system can still reserve memory out of themaiioc pool. 

Once a process no longer needs its allocated pseudo swap space, 
swapmem_cnt is incremented by the amount released and r_swapmem is 
updated. I f the system returns the pseudo swap space used for 
dynamically allocated kernel memory, the amount being released isfirtst 
added to sys_mem. Once sys_mem grows to its maximum value, any 
additional pages returned are used to update swapmem_cnt. 

swapmem_cnt must be less than or equal to swapmem_max and greater 
than or equal to zero. 

Because pseudo swap is shared by the swapper and memory allocation 
routines, it is used sparingly. The operating system periodically checks 
to see if physical swap space has been recently freed. If it has, the 
system attempts to migrate processes using pseudo swap only to use the 
available physical swap by walking the doubly linked list of pseudo swap 
regions. swapspc_cnt is decremented by the r_swapmem value for each 
region on the list until either swapspc_cnt drops to zero or no other 
regions utilize pseudo swap. swapmem_cnt is then incremented by the 
amount of pseudo swap successfully migrated. 

Pseudo Swap and Lockable Memory 

Because pseudo swap is related to system memory usage, the swap 
reservation scheme reflects lockable memory policies. 


Chapter 1 


85 




Figure 1-28 
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Although the system is not necesarily allocating additional memory 
when a process locks itself into memory, locked pages are no longer 
avail able for general use. This causes swapmem_cnt to be decremented 
to account for the pages. swapmem_cnt is also decremented by the size 
of the entire process if that process gets piocked in memory 

Reserving swap space from file-system swap to memory 


Memory File System Swap 



How Swap Space is Prioritized 

All swap devices and file systems enabled for swap have an associated 
priority, ranging from 0 to 10, indicating the order that swap space from 
a device or file system is used. System administrators can specify 
swap-space priority using a parameter of the swapon (1M) command. 

Swapping rotates among both devices and file systems of equal priority. 
Given equal priority, however, devices are swapped to by the operating 
system before file systems, because devices make more efficient use of 
CPU time. 

We recommend that you assign the same swapping priority to most swap 
devices, unless a device is significantly slower than the rest. Assigning 
equal priorities limits disk head movement, which improves swapping 
performance. 
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Three Rules of Swap Space Allocation 

• Start at the lowest priority swap device or file system. The lower the 
number, the higher priority; that is, space is taken from a system 
with a zero priority before it is taken from a system with a one 
priority. 

• If multiple devices have the same priority, swap space is allocated 
from the devices in a round-robin fashion. Thus, to interleave swap 
requests between a number of devices, the devices should be assigned 
the same priority. Similarly, if multiplefilesystems have the same 
priority, requests for swap are interleaved between the file systems. 

I n the figure, swap requests are initially interleaved between the two 
swap devices at priority 0. 

• If a device and a filesystem have the same swap priority, all the swap 
space from the device is al located before any file-system swap space. 
Thus, the device at priority 1 will be filled before swap is allocated 
from the file system at priority 1. 

Figure 1-29 Choosing a swap location 

swdev_pri swdevt swaptab 



Swap Space Structures 

Swapping is accomplished on HP-UX using the foil owing data structures: 
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Table 1-26 


• Device swap priority array (swdev_pri [ ]), used to link together 
swap devices with the same priority. That is, the entry in 
swdev_pri [n] is the head of a list of swap devices having priority n. 
The first field in swdev_pri[] structure is the head of the list; the 
sw_next field in the swdevtn structure links each device into the 
appropriate priority list. 

• Filesystem swap priority array (swf s_pri [ ]), which serves the 
same purpose as swdev_pri [ ] , but for file system swap priority. 

• Device swap table (struct swdevt), defined in conf. h to establish 
the fundamental swap device information. 

• Filesystem swap table (struct f swdevt), defined in swap. h for 
supplimentary file-system swap. 

• Swap table of available chunks (struct swaptab), which keeps 
track of the avail able free pages of swap space. 

• Mappingof swap pages (struct swapmap), whose entries together 
with swaptab combine for a swap disk block descriptor. 

The following table details the elements of the struct swdevt. 


Device swap table (struct swdevt) 


E lement 

Meaning 

sw_dev 

Actual swap device, as defined by its major 
(upper 8 bits) and minor (lower 24 bits) 
numbers. 

sw_enable 

Enabled flag. Zero if device swap is disabled; 
one if enabled. 

sw_start 

Offset into the swap area on disk, in kilobytes. 

sw_nblksavail 

Size of swap area, in kilobytes. 

sw_nblksenabled 

N umber of blocks enabled for swap. M ust be a 
multiple of swchunk (2MB default). 

sw_nfpgs 

N umber of free swap pages on the device. 

U pdated whenever a page is used or freed. 
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Table 1-27 


E lement 

Meaning 

sw_priority 

Priority of swap device (1-10). 

sw_head, 

sw_tail 

First and last swaptab[] entry associated 
with swap device. 

sw_next 

Pointer to the next device swap entry (swdevt) 
at this priority; implemented as a circular list 
used to update the poi nter i n swdev_pri for 
round-robin use of all devices at a particular 
priority. 


The foil owing table details the principle elements of the struct 
f swdevt. 


File system swap table (struct fswdevt) 


E lement 

Meaning 

fsw_next 

Pointer to next file system swap (fswdevt entry) 
at this priority; implemented as a circular list. 

fsw_enable 

Enabled flag. Zero if file-system swap is 
disabled; one if enabled. 

fsw_nfpgs 

Number of free swap pages in this file system 
swap; updated whenever a page is used or freed. 

fsw_allocated 

Number of swchunks (2MB default) allocated 
on this file-system swap. 

f sw_min 

Minimum swchunks to be preallocated when the 
file-system swap is enabled. 

fsw_limit 

Maximum swchunks allowed on filesystem; 
unlimited if set to zero. 

fsw_reserve 

Minimum blocks (of size fsw_bsize) reserved 
for non-swap use on this file system. 

fsw_priority 

Priority of device (0-10). Priority can also be 
determined by identifying swfs pri[] linked 
list. 
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E lement 

Meaning 

fsw_vnode 

vnode of the file system swap directory 
(/paging) under which the swap files are 
created. 

fsw_bsize 

Block size used on this file system; used to 
determine how much space fsw_reserve is 
reserving 

fsw_head 
fsw_tail 

1 ndex into swaptab [ ] of first, last entry 
associated with this file system swap. 

fsw_mntpoint 

Filesystem mount point; character 
representation of f sw_vnode, used for utilities 
(such as swapinfo (1M) ) and error messages. 


swaptab and swapmap Structures 

Two structures track swap space. The swaptab [] array tracks a 
chunk of swap space, swapmap entries hold swap information on a 
per-page level, swaptab defaults to track a 2MB chunk of space and 
swapmap tracks each page within that 2MB chunk. 

Each entry in the swaptab [ ] array has a pointer (called st_swpmp) to a 
unique swapmap. swapmap entries have backwards pointers to the 
swaptab index. There is one entry in the swapmap for each page 
represented by the swaptab entry (default 2 M B, or 512 pages); that is, 
swapmap conforms in size to swchunk. 

A linked list of free swap pages begin at the swaptab entry's st_free 
and use each free swapmap entry's sm_next. When a page of swap is 
needed, the kernel walks thestructures (using the getswap () routine 
in vm_swaiioc. c), which calls other routines that actually locatethe 
chunk, and so forth. 

• Beginning with the lowest priority, we begin by examining 
swdev_pri [ ] . curr, which points to a swdevt entry. 

• If sw_nfpgs is zero (nofree pages), wefollow the pointer sw_next to 
get the next swdevt entry at this priority. 

• If none of these have free pages, we move on to swfs_pri [ ] . curr, 
the filesystem swap at this priority, checking fsw_nfpgs for free 
pages. 
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• If we are still unsuccessful, we move to the next priority and try 
again. 

• Once we find a swdevt or f swdevt with free pages, we walk that 
device's swaptab list, starting with sw_head or fsw_head, and using 
st_next in each swaptab entry, until we find a swaptab entry with 
non-zero st_nfpgs. 

• st_f ree points to the first free swapmap entry (and thus first free 
page) in this swaptab chunk. 

• The swalloc () routinecreates a disk block descriptor (dbd) using 14 
bits of dbd_data for the swaptab index and 14 bits for the swapmap 
index. The r_bstore in the region is set to the disk device vnode or 
thefile system directory vnode, and the dbd is marked dbd_bstore. 

When faulting in from swap, the same process is followed as for 
faulting in from the file system: r_bstore and dbd_data are hashed 
together and checked for a soft fault, then devswap_pagein () is 
called. Thedevswap_pagein o routine uses the dbd_data as a 
14-bit swaptab index and a 14-bit swapmap index to determine the 
location of the page on disk. 

Now all information needed to retrieve the page from swap has been 

stored. 

Figure 1-30 The swaptab and swapmap structures 



Chapter 1 


91 





MEMORY MANAGEMENT 

SWAP SPACE MANAGEMENT 


Table 1-28 


Swap table entry (struct swaptab) 


E lement 

Meaning 

st_free 

1 ndex to the first free pagein the chunk. Each entry 
maps to a 4KB-age of swap. 

st_next 

1 ndex to next swaptab entry for same device or 
file-system swap; at end of list, st_next is -1. 

st_flags 

st_indel: File-system swap flag, indicating chunk 
is being deleted; do not allocate pages from it. Set 
only by the realswapoff () routine. 
st_free: File-system swap flag, indicating chunk 
may be deleted, because none of its pages are in use. 

1 n the case of remote swap, the chunk should not be 
deleted immediately; set st_free_time tocurrent 
time plus 30 minutes (1800 seconds) when setting 
this flag. Once 30 minutes has elapsed, the chunk 
can be freed. If the chunk is needed during the 
interim, the flag can be cleared using 
chunk_release (), called from lsync (). 
st_inuse: swaptab entry is being changed. 

st_dev, 
st_fsp 

Pointers to swdevt entry that references the 
swaptab entry. 

st_nfpgs 

Number of free pages in this (swchunk) swaptab 
entry. 

st_swpmp 

Pointer to swapmap [ ] array that defines this 
swchunk of swap pages. 

st_free_time 

1 ndicates when remote fs chunk can be freed (see 
explanation of st_free flag). 
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Table 1-29 


swap map entry (struct swapmap) 


Element 

Meaning 

sm_ucnt 

N umber of threads usi ng the page. When 
decremented to zero, the swap page is free and the 
free pages linked list can be updated. 

sm_next 

1 ndex of the next free page in the swapmap [ ]. This 
is valid only if sm_ucnt is zero; that means that 
this swapmap entry is included in the linked list 
beginning with swaptab's st_free. 


Deactivation using the pager 

Since vhand () is tuned to be nice regarding I/O usage and CPU usage, it 
allows the pager to fault out swapped processes. The swapper marks the 
process to be swapped for deactivation, which takes it off the run queue. 
Since it cannot run once its pages are aged, they cannot be referenced 
again. When the steal hand comes around, it steals all the pages in the 
region. 

When memory pressure is high, sched () selects a process to swap using 
the routine choose_deactivate (). This routine is biased to choose 
non-interactive processes over interactive ones, sleeping processes over 
running ones, and long-running processes over newer ones. 

Once a process has been chosen to be deactivated, the following actions 
occur: 

• The process’s sdeact flag and its threads' tsdeact flags are set. 

• The process's threads are removed from the run queue. 11 the process 
is waiting for I/O, its sdeactself flag and its threads' tsdeactself 
flags are set. When I/O completes, the process deactivates in the 
paging routines. 

• Theprocess’s p_deactime in theproc structure is set to the 
current time to establish a record of how long the process is 
deactivated. 

• The process is positioned in the active pregion chain to ready it for 
the steal hand. 

• The uarea pregion is added to the list of active regions for it to get 
paged out. 
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• The global counter deactive_cnt is incremented. 

A process that has been inactive long enough for all its pages to have 
been aged and stolen is virtually swapped out already. The global 
deactprocs points to the head of a list of inactive processes, its chain 
running through the pregion element p_nextdeact. If the average 
number of free pages drops below lotsf ree, these pages are swapped 
out. 

When memory pressure eases, a deactivated process is reactivated. The 
choose_reactivate () routine is biased to choose interactive over 
non-interactive ones processes, runnable processes over sleeping ones, 
and processes that have been deactivated longest over those more 
recently deactivated. 
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Overview of Demand Paging 

Recall that for a process to execute, all the regions (for data, text, and so 
forth) have to be set up; yet pages are not loaded into memory until the 
process demands them. Only when the actual page is accessed is a 
translation established. 

A compiled program hasa header containing information on the size of 
the data and code regions. As a process is created from the compiled code 
by fork and exec, the kernel sets up the process’s data structures and the 
process starts executing its instructions from user mode. When the 
process tries to access an address that is not currently in main memory, a 
page fault occurs. (For example, you might attempt to execute from a 
page not in memory.) The kernel switches execution from user mode to 
kernel mode and tries to resolve the page fault by locating thepregion 
containing the sought-after virtual address. The kernel then uses the 
pregion's offset and region to locate information needed for readinginthe 
page. 

If the translation is not already present and the page is required, the 
pdapage () routine executes to add the translation (space ID, offset into 
the page, protection ID and access permissions assigned the page, and 
logical frame number of the page), and then on demand brings in that 
page and sets up the translation, hashes in the table, and all the rest. 

I n main memory, the kernel also looks for a free physical page in which 
to load the requested page. If no free page is available, the system swaps 
or pages out selected used pages to make room for the requested page. 
The kernel then retrieves (pages in) the required page from file space on 
disk. It also often pages in additional (adjacent) pages that the process 
might need. 

Then the kernel sets up the page’s permissions and protections, and exits 
back to user mode. The process executes the instruction again, this time 
finding the page and continuing to execute. 

The flexibility of demand paging lies in the fact that it allows a process to 
be larger than physical memory. Its disadvantage lies in the degree of 
complexity paging requires of the processor; instructions must be 
restartableto handle page faults. 
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By default, all H P-UX processes are load-on-demand. A demand paged 
process does not preload a program before it is executed. The process 
code and data are stored on disk and loaded into physical memory on 
demand in page increments. (Programs often contain routines and code 
that are rarely accessed. For example, error handling routines might 
constitute a large percentage of a program and yet may never be 
accessed.) 

copy-on-write 

HP-UX now implements copy-on-write of exec_magic processes, to 
enable the system to manipulate processes more efficiently. The system 
used to copy the enti re data segment of a process every ti me the process 
forkd, increasing fork time as the size of the data and code segments 
increased. Only one translation of a physical page is maintained; a 
parent process can point to and read a physical page, but copies it only 
when writing on the page. The child process does not have a page 
translation and must copy the page for either read or write access. 

Copy-on-write means that pages in the parent's region are not copied to 
the chi Id’s region until needed. Both parent and child can read the pages 
without bei ng concerned about shari ng the same page. H owever, as soon 
as either parent or child writes to the page, a new copy is written, so that 
the other process retains the original view of the page. 

For more information about the implementaton of exec_magic, seethe 
HP-UX Process Management white paper. 


96 


Chapter 1 




MEMORY MANAGEMENT 

HOW PROCESS STRUCTURES ARE SET UP IN MEMORY 


HOW PROCESS STRUCTURES ARE 
SET UP IN MEMORY 

When a process is fork’d, a duplicate copy of its parent process forms the 
basisof the child process. . 

Region Type Dictates Complexity 

Under the kernel procdup () routine, the system walks the pregion list 
of the parent process, duplicating each pregion for the child process. 

How this is done is dictated by the region type. 

• If the region is type rt_shared, a new pregion is created that 
attaches to the parent’s region. 

• If the region is type rt_private, the region is duplicated first, and 
then a new pregion is created and attached to the new region. 

Duplicating pregions for Shared Regions 

Because a region of type rt_shared is shared by parent and child, fewer 
changes occur to the pregions and region: Only a new pregion must be 
created and attached to the shared region. 

• A new pregion is allocated and fields copied from the parent pregion 
to the child pregion. 

• The pregion elements used by vhand (p_agescan, p_ageremain, 
and p_steaiscan) are initialized to zero and the child pregion is 
added to the active pregion chain just before the stealhand, to 
prevent it from being stolen yet. 

• The region elements r_incore and r_refcnt are incremented to 
reflect the number of in-core pregions accessing the region and the 
number of pregions, in-core or paged, accessing the region. 

The procedure is considerably more complex when an rt_private 
region is copied. 
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Figure 1-31 Duplicating pregions with shared regions 
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Duplicating pregions for Private Regions 

Forking a process with a region of type rt_private requires that a new 

child region be allocated first. 

• The child region's pointers are set: 

• r_f store, the forward store pointer is pointed to the same value 
as the parent's, and the vnode's reference count (v_count) is 
incremented. 

• r_bstore, the backward store pointer is set to the kernel global 
swapdev_vp, and itsv_count is incremented also. 

• The child region is attached to the end of the linked list of active 
regions. 

• Swap is reserved. If insufficient swap space is available, fork o fails 
and returns the error enomem. 

• The child region's B-tree structures are initialized and sufficient 
swap space is reserved for a completely filled B-tree. 
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Figure 1-32 Duplicating a child process of type rt_private 
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• The parent’s vfd and dbd proto values are copied to the child’s 
B-tree root. 

• The vfd proto in both the parent region and the child region are set 
so that al I pages of the region are copy-on-write. 

• The B-tree element b_vproto is set to indicate that the 
copy-on-write flag (pg_cw) must beset in the vfd for any new vfddbd 
pair added to the B-tree. 

• A chunk of vfddbds is created for the child’s B-tree (equal to each 
chunk of vfddbds in the parent’s B-tree) and filled with proto 
values. The pg_cw bit is already set tocopy-on-writefor all default 
vfds in the child B-tree's chunk. 

Setting copy-on-write when the vfd is valid 

Before the chunks of vfddbds in the child region can be used, the 

validity of every entry must be checked. 

• If a vfd is not valid (that is, its pv_v is not set), thepg_cw of the 
parent's vfd must beset and copied tothe child. If pg_iock is set in 
the parent, it must be unset in the child, as locks are not inherited. 
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Once the vfd is valid, further modifications are madetothe low-level 
structures: 

• The r_nvaiid element in the child region is incremented to reflect 
the number of valid pages. 

• Thevfd contains a pfn (page frame number), which indexes into the 
pfdatn array. Thepfdat entry pf_use count (number of regions 
using this page) must be incremented. 

• I f the parent vf d's copy-on-write bit isn’t set, the pde must be set for 
translations to the page to behave as copy-on-write. 

Reconci Ii ng the Page and Swap I mage 

If a page has been written to a swap device, but has since been modified, 
the swap-device data now differs from the data in memory. The disk 
page must be disassociated from the page in memory by setting the dbd 
type to dbd_none . Then, the next time the page is written to a swap 
device, it will be assigned a new location. 

Everything is now set up from the perspective of the parent’s B-tree for 
copy-on-write. 

Setting the child region's copy-on-write status 

• The child's r_swaiioc is set to the number of region and B-tree 
pages reserved. 

• Ther_ _prev and r_next are set to li nk the child region to the parent 
region. 

• The kernel chooses new space for thepregion, rather than copying it 
from the parent pregion. This establishes two ranges of virtual 
addresses (different space, same offset) translating to the single 
range of physical address. 

• If a parent process accesses its virtual addresses, it willl get a TLB 
miss fault because the addresses have been purged from theTLB. 

• If a child process accesses any of its virtual addresses, it will also 
get a TLB miss fault because the addresses did not previously 
exist in theTLB, and do not exist in htbl. 
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Duplicating a Process Address Space to Make 
the Process copy-on-write 

• procdup () creates a duplicate copy of a process based on forktype, 
parent process (pp), child process (cp), and parent thread (pt) and 
child thread (ct). 

procdup () allocates memory for theuarea of thechild. (Infact, 
procdup () is the routine that calls createu () to create the uarea 
too.) 

procdup () calls dupvas to duplicatethe parent’s virtual address 
space, based on the kind of process (fork vs vfork) being executed. 

• If the process was created by fork, dupvas duplicates the parent 
process’s virtual address space; if the process was vfork’d the 
parent’s virtual address space is used. 

dupvas looks for and finds each private data object, does whatever 
each requires to be duplicated (there are special considerations 
required for text, memory mapping, data objects, graphics), and when 
it finishes duplicating the special objects, calls private_copy or 
shared_copy, depending on whether it is dealing with a private or 
shared region. 

• If the region is shared, shared_copy increments the reference 
count on the region to indicate it is being shared. 

• If the region is private, private_copy locks the region and 
enables the region to be duplicated by calling dupreg () . 

• dupreg () allocates a new region for the child, duplicates the parent's 
vfds and the entire region structure, then calls do_dupc to duplicate 
entries under the region. 

• do_dupc ( ) sets up a parent-child relationship, and by duplicating 
the relationship, sets up the child to be copy-on-write. It makes 
sure the parent’s region is valid, sets copy on write for the child, sets 
the translation as rx (read-execute) only, duplicates information for 
every vfddbd combination intheregion. 

oncedo_dupc () completes, thechild process exists as a duplicated 
version of the parent process. The child process is attached to the 
child’s address space and is no longer dependent on the parent. 

• do_dupc then calls hdi_cw o to update the chi Id’s access rights and 
make the chi Id copy on write. 
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Duplicating the uarea for the Child's Process 

The createU () routine builds a uarea and address space for the child 
process. The uarea is set up last for a fork’d process, to prevent the 
child process from resuming in the middle of pregion duplication code. 
If the process isvfork’d, the uarea iscreated during exec (). Until 
then, the chi Id uses the parent’s uarea. 

• When a user process is created with fork_process, a temporary 
space is allocated for a working copy of the parent’s uarea to be 
modifed into the chi Id’s uarea. The temporary space will be freed 
after the uarea is copied to the new region. fork() updates the 
savestate in the parent uarea'su_pcb just before copying the data, 
(vfork o does not do this because it creates the uarea during 
exec (), and the savestate will change immediately.) 

• A region is allocated for the new uarea, its data structure is 
initialized, its r_bstore value set back to the swap device, and the 
new region is added to the list of active regions. The uarea has no 
r_f store value, since it comes with ready-made data. 

• Space is allocated for the uarea's pregion, which is initialized. 

Each uarea has a unique space ID. The new pregion is marked with 
the PF_NOPAGE flag, uarea pregions are unaffected by vhand 
because they are not added to the list of active pregions. Only if an 
entire process is swapped out are the uarea's pages written to a swap 
device. 

• Once created, the pregion is attached into the linked list of 
pregions connected to the vas. Its pointer isstored in r_pregs, its 
p_prpnext Set tONULL, and its r_incore and r_refcnt set toone. 

• Once swap space is reserved for the uarea and B-tree pages and 
the default dbd is set to dbd_dfill, the uarea pages (upages) are 
allocated. Each page requires a pfdat entry from phead (sleeping if 
none is available immediately). Thepfn is stored in the vfd, the 
pg_v is set as valid, r_nvaiid is incremented, and a pde is created 
for the physical-to-virtual translation. The pfdat entry's p_uarea 
and hdlpf_trans flags are set, and the dbd is set to dbd_non . 

• The pointers u_procp (to the child process) and u_kthreadp (to the 
child thread) are pointed to the child uarea. 

Conceivably, the child can now run successfully. The current state is 
therefore saved in the copied uarea with a set jmp o call and pointed 
to with pcb_sswap. Thus, when the child first calls the resume () 
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routine, it detects that pcb_sswap is non-zero and does a long jmp () to 
get back here. The child then return from procdup () with the value 

FORKRTN_CHILD. 

The parent’s open file table is copied to the child and the copied uarea is 
copied into the actual pregion. This copy causes TLB miss faults that 
cause the pregion's pdes to be written to the TLB, thus associating the 
uarea's virtual address with the physical pages just set up. The process 
completes by returning from procdup with the return value 

FORKRTN_PARENT. 

Reading from the parent's copy-on-write 
page 

When the parent region accesses one of its rt_private pages for read, 
the processor generates a TLB miss fault, which the kernel handles as an 
interrupt. TheTLB miss fault handler finds the pde and inserts the 
information (including the new access rights) into the processor's TLB. 
On return from the interrupt, the processor retries the read and is 
successful, sincePDE_AR_cw allows user-mode read and execute access 

Figure 1-33 The first time a read is done to a copy-on-write page 

address = space.offset address = spacep. offset 
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Reading from the child's copy-on-write page 

When the child region accesses one of its pages for read, the TLB miss 
handler does not find a pde for the virtual address, because none has 
been set one up yet. The virtual address was set up in the pregion 
structure. I f you are not doi ng copy-on-access (which is now the default) 
and the page is needed, the aliased translation must be made. 

• First a save state is created. 

• The vas pointer is taken and the skip list searched to find the 
pregion containing the page with this address. 

• If the page translates to more than one virtual address, the 
appropriate alias is acquired. 

• The child region fails to access a page for read and gets a TLB miss, 
but the miss handler finds a translation and loads it into the TLB. 

• The routine returns from interrupt and succeeds in reading the page. 

Faulting In A Page 

When regions are initialized, the disk block descriptor (dbd) dbd_data 
field of the is set to dbd_dinval (Oxfffffff) in all cases. The 
prototype dbd_type val ues are set as fol I ows: 

• dbd_fstore for text and initialized data, 

• dbd_dzero for stack and uninitialized data. 

When a page is read for the first time, a TLB miss fault results because 
the physical page (and therefore its translation in the sparse PDIR) does 
not yet exist. The fault handler is responsible for bringing in the page 
and restarting the instruction that faulted. I n determining whether or 
not the page is valid, the fault handler determines which pregion in the 
faulti ng process contai ns the fau Iti ng address. The fault code eventual ly 
calls virtuai_f ault (), the primary virtual-fault handling routine . 
The arguments passedtothis routine arethe virtual address causing the 
fault, the virtual address and virtual space of the pregion, and a flag 
indicating read or write access. 

The kernel searches the B-tree for the vfd and dbd ofthepage. Ifthe 
valid bit in the vfd flag is set, another process has read the address into 
memory already. Ifthe r_zomb flag is set in the region, the program 
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prints Pid %d killed due to text modification or page I/O 
error message and returns sigkill, which the handler sends to the 
process. 

Faulting In a Page of Stack or Uninitialized Data 

If the dbd_type value is set to dbd_dzero (as is the case for stack and 
uninitialized data), the process sets the copy-on-write bit to zero. The 
kernel then checks to determine whether the page pertains to a system 
process or to a high-priority thread. If neither and memory is tight, the 
process sleeps until free memory is driven down to the priority 
associated with the process. (I n worst case, a thread might wait until 
memory is above desf ree.) 

Once the process is restarted, vfd and dbd pointers are examined to 
ensure their continued accuracy. Afreepfdat entry is acquired from 
phead, its pfn (pf_pfn) placed in the vfd, the vfd's valid bit set, and 
the region’s r_nvaiid counter (number of valid pages) incremented. 
The process changes dbd_type to dbd_none and dbd_data to 
OxfffffOc. Finally, the virtual-to-physical translation of the page is 
added to the sparse PDIR and the page is zeroed. 

Figure 1-34 Checking the free list to fault in a dbd_fstore page 


pfdat 
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Faulting in a Page of Text or Initialized Data 

If a process has a virtual fault on a dbd_fstore page, the kernel uses 
the r_ _f store pointer of the region’s vnode, to determine which 
file-system specific pagein o routine (for example, uf s_pagein (), 
nf s_pagein (), cdf s_pagein (), vx_pagein ()) to call. The page in () 
routines are used to recover the correct page from a free list of memory 
pages or to read in a correct page from disk. 

The pagein routine gets information about the page being faulted from 
the vm_pagein_init () routine, which gets the vfd/dbd pairs, sets 
up the region index, and ascertains that no valid page already exists. 

One page must be reserved. Then vm_no_io_required () is called to 
determine if the page can be satisfied locally, either by a zero-filled page 
(sparse file) or from the page cache. 

vm_no_io_required () checks for the faulted page in the page cache: 

• vm_no_io_required acquires the device vnode pointer (devvp) that 
points to the actual disk device (such as/dev/vg00/lvol5) rather than 
to the file referenced by r_fstore. 

• If the dbd data field is dbd_dinval, vm_no_io_required gets the 
actual location of the disk block on the disk device and stores this 
value in the dbd data field. 

• vm_no_io_required calls pageincache () with the device vnode 
pointer and the dbd_datato determine whether the faulted page is 
on the hash list. 

• The pageincache () routine hashes on the vnode pointer and data 
to choose a pfdat poi nter i n phash[]. The routine walks the 
pf_hchain chain of pfdat entries looking for a matching vnode 
pointer (pf_dewp) and data value(pf_data). If it finds a match, it 
removes it from the free list. 

• If pageincache () returns a pfdat entry, the region’s valid page 
count (r_nvaiid) is incremented, thevfd is updated with thepfn 
(pf_pfn), and a virtual-to-physical translation for the page to the 
sparse PDIR is added (if it had been removed). 

On successfully finding the page in thefree list, vm_no_io_required () 
returns a 1, meaning that no I/O is required to retrieve the page. This is 
called a soft page fault. 
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If vm_no_io_required() cannot find the page locally, it returns 0, 
meaning the page must be faulted in from disk. 

Retrieving the Page of Text or Initialized Data from 
Disk 

If the required page is not found in the free list, the pagein o routines 
refer to dbd to ascertain which page to fetch. (The information had been 
Stored in the dbd by vm_no_io_required () .) ThepageinO routines 
also schedule read-ahead pages for I/O, the number of read-ahead pages 
based on the value of p_pagein in the pregion. This value is adjusted 
based on whether the file is being accessed at random or sequentially. 

Figure 1-35 dbd_fstore fault of data not in the free list 



If it is being accessed at random, a minimal number of read-ahead pages 
are required; if sequentially, a maximal number of read-ahead pages are 
desired, up to the end of thepregion's pages. 

• Each time I/O is scheduled for the pregion, thep_nextfauit bit in 
the pregion structure is set to the page expected to be read next if 
further reading is required. 

• If the next page fault matches p_nextfauit, the file is being 
accessed sequentially. I n this case, the value of p_pagein is 
multiplied by two, up tomaxpagein_size, a global set to 64. If 
p_strength is less than 100 (defined as purely_sequential), it is 
also incremented. 

• If the next fault does not match p_nextfauit, the file is being 
accessed at random. I n this case, the value of p_pagein is divided by 
two, down to no less than minpagein_size, a global set to 1. If 
p_strength is greater than -100 (defined as purely_random), it is 
also decremented. 
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A page of memory is allocated from phead, a virtual-to- physical 
translation added tothe sparse PDIR, the I/O scheduled from the disk to 
the page, and the process put to sleep awaiting the non-read-ahead I/O to 
complete (the process does not await read-ahead I/O to complete). The 
vfd is marked valid. The dbd is left with dbd_type set to dbd_fstore 
and dbd_data set to the block address on the disk. 

Regardless of whether the page data is retrieved from zero-fill, free list, 
or disk, the page directory entry (pde) has been touched. The i nstruction 
is retried and gets a TLB miss fault; the miss handler writes the 
modified pde data into the TLB; the instruction is retried again and 
succeeds. 

p_strength varies between -100 and 100; p_pagein varies by powers 
of two between 1 and 64. 
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VIRTUAL MEMORY AND exec( ) 

When the system performs an exec (), the virtual memory system 
concerns itself with cleaning up old pregions/regions and setting up 
new ones. 

Cleaning up from a vfork () 

Cleanup in the vfork o case is simple. 

• The child process is executing but borrowing its resources from the 
parent process. 

• The routine creates its own uarea and returns the parent's resources. 

• Then the routine adds text, data, and soon. 

• The routine gets a new vas and attaches it to the child process 

(p_vas). 

• The uarea and stack of the parent process are copied and the 
pregions and regions are created for the child uarea, just as for 

a fork_process fork type. 

• The uarea is copied i nto the chi Id's uarea region, which is 
pointed to the now-complete uarea from the thread, and the 
thread switches from usi ng the parent’s kernel stack to the new 
child kernel stack. 

Disposing of the old pregions: dispreg () 

If exec() is called after a fork_process fork, several regions must 
bedisposed of first. Typically, all pregions aredisposed of except for the 
pt_uarea pregion, which is still needed. If the file is calling exec () on 
itself, we save a little processing and keep the pt_text and 
pt_nulldref regions, too. 

• deactivate_preg () is used to deactivate the pregion by removing 
it from the active pregion list. If the agehand is pointing to the 
pregion being deactivated and stealhand is pointing to the next 
region in the active pregion list, the agehand is moved back one 
pregion to prevent the agehand from exceeding the stealhand in 
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sequence. Otherwise if the agehand or stealhand is pointing to the 
pregion being deactivated, both hands are moved forward one 

pregion. 

• If the region is type rt_private or the pregion being discarded is 
the last attached, its resources must be freed up. 

• wait_f or_io ( ) awaits completion of any pending I/O to the 
region (that is, r_poip = o), so that no I/O request returns to 
modify a page now assigned a different purpose. 

• The region's B-tree is traversed to delete all thevirtual address 
translations. (That is, for each valid vfd, theTLBs are purged, 
the cache flushed, and thepde entry invalidated (set space to-1, 
address to 0, pfn to 0, valid to 0, ref to 0, and clear the bit from 

pde_os). 

• If thepde is not the htbl entry, thepde is moved from hash list to 
freelist. If it is the htbl pde and it is unused, an effort is made to 
fill it with a translation down its linked list, and then free the copied 

pde. 

• The physical-to-virtual translation is removed from 

pfn_to_virt_tabie. If it was the last virtual translation for this 
physical page, the hdlpf_trans is cleared in thepfdat entry. 

• The pregion pointer is removed from the rpregs list and the 
memory used by the pregion is freed (that is, returned to its kernel 
memory bucket). 

• The region’s r_incore and r_refcnt elements are decremented. If 
r_ref cnt equals zero, the region is freed also. 

• Again, r_poip must decrement to zero before a region can be 
freed, to prevent any unexpected I/O to its pages. 

• The B—t ree is wal ked agai n, and for each val id page found, 
r_nvaiid andpf_use are decremented in thepfdat entry. If 
the physical pageisnot aliased, itspf_use will now beO; it can be 
freed for other uses. 

• Its p_queue flag is set and the page is put on the pfdat free list 
(phead). The kernel global freemem is incremented. If any other 
processes are waiting for memory, we wake them all up so that the 
first one here can have the page (the losers of the race will goto 
sleep again). 
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• If r_bstore is swapdev_vp, the reserved swap pages (r_swaiioc) 
are released, as are the swap pages reserved for the B-tree 
Structure (r_root->b_rpages). 

• The pages themselves are freed by invalidating their pdes, purging 
the TLBs, flushing the caches, moving the non-HTBL pdes from the 
hash list to the free list, and linking thepfdat entry into phead. 

• r_root and r_chunk region elements are moved back to the buckets 
rather than being freed. 

• activeregions is decremented; the region is removed from the 
r_forw / r_back region chain, and the region memory returned to 
its memory allocation bucket. 

Building the new process 

If the process for which memory structures are being created is the first 
tousethe a. out as an executable, the a.out vnode's v_vas is NULL, 
and requires creating the pseudo-vas, pseudo-pregion, and region. 
Otherwise, the pseudo-vas' reference count is updated. 

• To what region a pt_text pregion is attached depends on the type 
of executable. 

• If the executable is non-EXEC_MAGic, a pt_text pregion is 
attached to the pseudo-vas region. 

• If the executable is exec_magic, va_wrtext is set in the process 
vas, the pseudo-vas' region is duplicated as a type rt_private 
region (performing all the steps discussed for an rt_private 
region), rf_swlazywrt is set in the new region sothat no swap is 
reserved before needed, and a pt_text pregion is attached to it. 

• In both cases, a new space is attached to the pregion's virtual 
address. 

• A pt_nulldref pregion is attached to the global region 
(giobainulirp), using the same space as pt_text. 

• The pseudo-vas' region is duplicated as a type rt_private 
region using r_off to point to the beginning of the data portion of 
the a. out file. A pt_data pregion is attached to it. If this is 

an exec_magic executable, we use the pt_text pregion's 

space, otherwise a new space is assigned. 
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• The pt_data pregion is incremented by the size of bss 
(uninitialized data area), using dbd type dbd_dzero. Thissets 
b_protoidx to the end of the inititialized data area and 
b_proto2 to dbd_zero. M ore swap is reserved. 

• A private region of three pages (ssize + 1 ) is created for the user 
stack. The dbd proto value is set toDBD_DZERO, and a pt_stack 
pregion is attached at USRSTACK. The PT_DATA pregion'S 
space is used. 

• When a shared library is linked to the process, two pt_mmap 
pregions are created: an rt_shared pregion containing text 
mapped intothethird quadrant with a space of kernelspace and 
an rt_private pregion containing associated data (such as 
library global variables) with the pt_data pregion's space. 

• If va_wrtext is set, thedata pregion takes the first available 
address above where the text ends (in the first or second 
quadrant); othwerwise it is assigned the first available address 
above 0x40000000 (thesecond quadrant). 

Virtual memory and exit () 

From the virtual memory perspective, an exito resembles the first 
part of an exec () . All virtual memory resources associated with the 
process are discarded, but no new ones are allocated. 

Thus, when exiting from a vfork child before the chi Id has performed 
an exec () , nothing needs to be cleaned up from virtual memory except 
to return resources to the parent process. If exiting from a non-vfork 
child, the virtual memory resources are discarded by calling 

dispreg () . 
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